The Efficacy of Body-Mass Index and Child Mortality Rate In Indicating the Causes of Mortality In a Country

By Cao Yueran

Introduction

Previous studies on years of lives lost have shown that in general, the lives of the citizens of more developed countries are cut short due to degenerative diseases such as stroke and heart disease, while for developing countries, infectious disease still remains a leading cause of death. However, previous measurements of "development" use GDP per capita or qualitative analysis which does not take into account the inequality of a country. This projects attempts to find a correlation between the causes of mortality and two indexes that are relatively resistant to inequality, BMI and CMR.

Specific causes of mortality is generally a very difficult type of census data to gather, especially for less financially capable countries. These countries also tend to be the ones most in need of accurate data as to why their citizens are dying young. If BMI and CMR, two widely availabe indexes could accurately indicate a country's developmental status in regards to healthcare (as reflected in causes of mortality), then countries would be able to use them as a basis for targeted healthcare policies instead of causes of mortality.

Methodology

Mortality data is acquired from https://apps.who.int/healthinfo/statistics/mortality/whodpms/. The website is very difficult for a machine to navigate, so 16 .xml files were chosen and downloaded manually, each pertaining to a general cause of death. The .xml files were subsequently converted to .csv files for easy import. Each .xml contains Age-Standardized Death Rates (ASDRs) by country, by year (1979-2016).

The WHO world standard population were used to calibrate the mortality data. The ASDRs were calculated by WHO, so it is using the same formula as the one used in the BMI dataset (also by WHO). This should make the results a bit more indicative.

BMI Data

The WHO BMI dataset contains 25000 datapoints of the mean male, female and combined BMI of each country, per year (1975-2016). I am only interested in the combined BMI, for the ASDR are combined as well. Within each data cell, it also has the uncertainty bounds, but I will only be using the mean BMI.

CMR Data

The UNICEF child mortality dataset contains 20000 datapoints of median child mortality rate by country, by year (1950-2019). It contains both sexes, male and female statistics, but once again we are only interested in both sexes.

Miscellaneous

Data Wrangling

The index for diseases is available locally.

In [1]:
import numpy as np, pandas as pd
import requests
from matplotlib import pyplot as plt
import seaborn as sns
from IPython.display import display, HTML

pd.options.display.max_rows = 20
pd.options.mode.chained_assignment = None

disease_codes = pd.read_csv("disease_codes.csv", dtype={"disease": str})
disease_codes.index = np.arange(1, len(disease_codes) + 1)
disease_codes
Out[1]:
disease name
1 1001 Certain infectious and parasitic diseases
2 1026 Neoplasms
3 1048 Diseases of the blood and blood-forming organs...
4 1051 Endocrine, nutritional and metabolic diseases
5 1055 Mental and behavioural disorders
6 1058 Diseases of the nervous system
7 1062 Diseases of the eye and adnexa
8 1063 Diseases of the ear and mastoid process
9 1064 Diseases of the circulatory system
10 1072 Diseases of the respiratory system
11 1078 Diseases of the digestive system
12 1082 Diseases of the skin and subcutaneous tissue
13 1083 Diseases of the musculoskeletal system and con...
14 1084 Diseases of the genitourinary system
15 1087 Pregnancy, childbirth and the puerperium
16 1093 Congenital malformations, deformations and chr...

16 .xml files containing ASDRs will now be imported and merged into one DataFrame.

In [2]:
forConcat = []
for x in range(1, len(disease_codes) + 1):
    temp = pd.read_csv(str(x) + ".csv", header=1)
    temp.insert(1, "Cause", x)
    forConcat.append(temp)

mdf = pd.concat(forConcat, ignore_index=True)
mdf = mdf.loc[mdf.Countries != "Total reporting countries"]
mdf.rename({"Countries": "Country"}, axis=1, inplace=True)
mdf.reset_index(inplace=True, drop=True)
mdf.replace(" ", np.nan, inplace=True)
mdf
Out[2]:
Country Cause 2016 2015 2014 2013 2012 2011 2010 2009 ... 1988 1987 1986 1985 1984 1983 1982 1981 1980 1979
0 Albania 1 NaN NaN NaN NaN NaN NaN 1.2 1.1 ... 8.9 11.4 NaN NaN NaN NaN NaN NaN NaN NaN
1 Anguilla 1 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Antigua and Barbuda 1 NaN 28.4 39.5 29.8 38.3 NaN NaN 21.6 ... 15.8 10.2 20.6 11.7 NaN NaN NaN NaN NaN NaN
3 Argentina 1 NaN 23.9 26.0 26.3 25.6 27.0 27.6 27.2 ... 25.6 26.5 24.7 23.5 25.3 25.3 23.3 27.7 28.0 29.8
4 Armenia 1 7.1 9.0 8.4 7.4 8.5 8.2 9.8 9.2 ... 11.7 15.2 12.8 13.7 NaN NaN 12.6 13.6 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2251 Uruguay 16 NaN 5.9 5.2 6.3 6.4 NaN 5.1 5.3 ... 8.6 9.0 8.2 8.8 9.1 9.1 8.4 8.2 10.2 NaN
2252 Uzbekistan 16 NaN NaN 2.4 1.7 1.5 1.3 1.3 1.7 ... 5.0 6.5 7.4 7.4 NaN NaN 6.7 6.2 NaN NaN
2253 Venezuela (Bolivarian Republic of) 16 NaN NaN NaN 7.5 7.6 7.1 7.5 6.7 ... 6.9 6.4 6.4 6.0 NaN 6.8 6.7 6.3 6.0 5.6
2254 Virgin Islands (USA) 16 NaN 0.8 NaN NaN 2.4 3.5 4.1 1.1 ... NaN NaN NaN NaN NaN NaN NaN NaN 8.7 NaN
2255 Zimbabwe 16 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2256 rows × 40 columns

Let us proceed to map the names of causes onto Cause.

In [3]:
mdf.insert(
    2, "CauseName", mdf.Cause.map(dict(zip(disease_codes.index, disease_codes.name)))
)
mdf.sort_values(by=["Country", "Cause"], inplace=True)
mdf.reset_index(inplace=True, drop=True)
mdf
Out[3]:
Country Cause CauseName 2016 2015 2014 2013 2012 2011 2010 ... 1988 1987 1986 1985 1984 1983 1982 1981 1980 1979
0 Albania 1 Certain infectious and parasitic diseases NaN NaN NaN NaN NaN NaN 1.2 ... 8.9 11.4 NaN NaN NaN NaN NaN NaN NaN NaN
1 Albania 2 Neoplasms NaN NaN NaN NaN NaN NaN 56.4 ... 106.9 95.8 NaN NaN NaN NaN NaN NaN NaN NaN
2 Albania 3 Diseases of the blood and blood-forming organs... NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania 4 Endocrine, nutritional and metabolic diseases NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Albania 5 Mental and behavioural disorders NaN NaN NaN NaN NaN NaN 1.7 ... 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2251 Zimbabwe 12 Diseases of the skin and subcutaneous tissue NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2252 Zimbabwe 13 Diseases of the musculoskeletal system and con... NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2253 Zimbabwe 14 Diseases of the genitourinary system NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2254 Zimbabwe 15 Pregnancy, childbirth and the puerperium NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2255 Zimbabwe 16 Congenital malformations, deformations and chr... NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2256 rows × 41 columns

Let us remove the countries with no datapoints from 1979 until 2016. They are of no use to our analysis. The other missing data points would be left in for now, as it may come in handy during the plotting of a choropleth map to show which countries' infrastructure are so bad that they have no mortality data.

In [4]:
missingPerCountry = (
    mdf.groupby("Country")
    .apply(lambda x: x.notna().sum())
    .sum(axis=1)
    .sort_values(ascending=False)
)
noDataCountries = list(missingPerCountry[missingPerCountry <= 48].index)
mdf.drop(mdf[mdf.Country.isin(noDataCountries)].index, inplace=True)
mdf.reset_index(inplace=True, drop=True)
mdf
Out[4]:
Country Cause CauseName 2016 2015 2014 2013 2012 2011 2010 ... 1988 1987 1986 1985 1984 1983 1982 1981 1980 1979
0 Albania 1 Certain infectious and parasitic diseases NaN NaN NaN NaN NaN NaN 1.2 ... 8.9 11.4 NaN NaN NaN NaN NaN NaN NaN NaN
1 Albania 2 Neoplasms NaN NaN NaN NaN NaN NaN 56.4 ... 106.9 95.8 NaN NaN NaN NaN NaN NaN NaN NaN
2 Albania 3 Diseases of the blood and blood-forming organs... NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania 4 Endocrine, nutritional and metabolic diseases NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Albania 5 Mental and behavioural disorders NaN NaN NaN NaN NaN NaN 1.7 ... 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1787 Virgin Islands (USA) 12 Diseases of the skin and subcutaneous tissue NaN 1.0 NaN NaN 0.6 1.8 4.1 ... NaN NaN NaN NaN NaN NaN NaN NaN 3.0 NaN
1788 Virgin Islands (USA) 13 Diseases of the musculoskeletal system and con... NaN 1.6 NaN NaN 4.3 2.4 2.9 ... NaN NaN NaN NaN NaN NaN NaN NaN 3.2 NaN
1789 Virgin Islands (USA) 14 Diseases of the genitourinary system NaN 8.3 NaN NaN 7.7 7.1 13.8 ... NaN NaN NaN NaN NaN NaN NaN NaN 13.9 NaN
1790 Virgin Islands (USA) 15 Pregnancy, childbirth and the puerperium NaN 0.0 NaN NaN 0.0 0.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 14.7 NaN
1791 Virgin Islands (USA) 16 Congenital malformations, deformations and chr... NaN 0.8 NaN NaN 2.4 3.5 4.1 ... NaN NaN NaN NaN NaN NaN NaN NaN 8.7 NaN

1792 rows × 41 columns

Reverse the year numbers for easier graph plotting.

In [5]:
mdf = mdf[
    ["Country", "Cause", "CauseName"] + list(mdf.columns[:2:-1])
]  # I don't get why this is [:0:-1] instead of [1::-1] but ok.
mdf
Out[5]:
Country Cause CauseName 1979 1980 1981 1982 1983 1984 1985 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 Albania 1 Certain infectious and parasitic diseases NaN NaN NaN NaN NaN NaN NaN ... 2.0 2.0 1.1 1.2 NaN NaN NaN NaN NaN NaN
1 Albania 2 Neoplasms NaN NaN NaN NaN NaN NaN NaN ... 74.2 80.5 74.7 56.4 NaN NaN NaN NaN NaN NaN
2 Albania 3 Diseases of the blood and blood-forming organs... NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Albania 4 Endocrine, nutritional and metabolic diseases NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Albania 5 Mental and behavioural disorders NaN NaN NaN NaN NaN NaN NaN ... 1.8 1.6 1.5 1.7 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1787 Virgin Islands (USA) 12 Diseases of the skin and subcutaneous tissue NaN 3.0 NaN NaN NaN NaN NaN ... 2.5 1.4 0.7 4.1 1.8 0.6 NaN NaN 1.0 NaN
1788 Virgin Islands (USA) 13 Diseases of the musculoskeletal system and con... NaN 3.2 NaN NaN NaN NaN NaN ... 3.6 3.0 1.7 2.9 2.4 4.3 NaN NaN 1.6 NaN
1789 Virgin Islands (USA) 14 Diseases of the genitourinary system NaN 13.9 NaN NaN NaN NaN NaN ... 8.6 6.8 7.5 13.8 7.1 7.7 NaN NaN 8.3 NaN
1790 Virgin Islands (USA) 15 Pregnancy, childbirth and the puerperium NaN 14.7 NaN NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 NaN
1791 Virgin Islands (USA) 16 Congenital malformations, deformations and chr... NaN 8.7 NaN NaN NaN NaN NaN ... 7.1 4.0 1.1 4.1 3.5 2.4 NaN NaN 0.8 NaN

1792 rows × 41 columns

Let us assign ISO codes for each country and remove the locations that are not countries. We will also remove the rows with not a single data point. I chose to use get instead of search_fuzzy because the latter is way too slow.

I will also map the codes back onto the name to remove any inconsistencies as pycountry is not a 1-to-1 mapping.

In [6]:
import pycountry


def get_country_code(name):
    try:
        return pycountry.countries.lookup(name).alpha_3
    except:
        return None


def get_country_name(code):
    try:
        return pycountry.countries.get(alpha_3=code).name
    except:
        return None


mdf.insert(0, "CountryCode", mdf.Country.apply(get_country_code))
mdf2 = mdf.copy()

mdf.drop(mdf[mdf.CountryCode.isna()].index, inplace=True)
mdf.dropna(thresh=5, inplace=True)
mdf.reset_index(inplace=True, drop=True)
mdf[mdf.CountryCode == "MDK"]
Out[6]:
CountryCode Country Cause CauseName 1979 1980 1981 1982 1983 1984 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

0 rows × 42 columns

We have to check that the countries left out are actually not countries, not just incorrectly named. If they are actually countries, then we should rename them and concatenate them back into mdf.

In [7]:
leftOut = mdf2[mdf2.CountryCode.isna()]
display(leftOut.Country.value_counts())
countryMapping = {
    "Hong Kong SAR": "Hong Kong",
    "Macau": "Macao",
    "Reunion": "Réunion",
    "Saint Vincent and Grenadines": "Saint Vincent and the Grenadines",
    "Virgin Islands (USA)": "Virgin Islands, U.S.",
    "TFYR Macedonia": "North Macedonia",
    "Iran (Islamic Rep of)": "Iran, Islamic Republic of",
    "Republic of Korea": "Korea, Republic of",
    "Venezuela (Bolivarian Republic of)": "Venezuela, Bolivarian Republic of",
}
leftOut.replace(countryMapping, inplace=True)
leftOut.drop("CountryCode", inplace=True, axis=1)
leftOut.insert(0, "CountryCode", leftOut.Country.apply(get_country_code))
print("number of nan: " + str(leftOut.CountryCode.isna().sum()))

mdf = pd.concat([mdf, leftOut], ignore_index=True)
mdf
Iran (Islamic Rep of)                 16
TFYR Macedonia                        16
Republic of Korea                     16
Venezuela (Bolivarian Republic of)    16
Hong Kong SAR                         16
Macau                                 16
Saint Vincent and Grenadines          16
Reunion                               16
Virgin Islands (USA)                  16
Name: Country, dtype: int64
number of nan: 0
Out[7]:
CountryCode Country Cause CauseName 1979 1980 1981 1982 1983 1984 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 ALB Albania 1 Certain infectious and parasitic diseases NaN NaN NaN NaN NaN NaN ... 2.0 2.0 1.1 1.2 NaN NaN NaN NaN NaN NaN
1 ALB Albania 2 Neoplasms NaN NaN NaN NaN NaN NaN ... 74.2 80.5 74.7 56.4 NaN NaN NaN NaN NaN NaN
2 ALB Albania 5 Mental and behavioural disorders NaN NaN NaN NaN NaN NaN ... 1.8 1.6 1.5 1.7 NaN NaN NaN NaN NaN NaN
3 ALB Albania 6 Diseases of the nervous system NaN NaN NaN NaN NaN NaN ... 12.7 4.8 4.8 7.3 NaN NaN NaN NaN NaN NaN
4 ALB Albania 7 Diseases of the eye and adnexa NaN NaN NaN NaN NaN NaN ... 0.1 0.1 0.1 0.1 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1776 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue NaN 3.0 NaN NaN NaN NaN ... 2.5 1.4 0.7 4.1 1.8 0.6 NaN NaN 1.0 NaN
1777 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... NaN 3.2 NaN NaN NaN NaN ... 3.6 3.0 1.7 2.9 2.4 4.3 NaN NaN 1.6 NaN
1778 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system NaN 13.9 NaN NaN NaN NaN ... 8.6 6.8 7.5 13.8 7.1 7.7 NaN NaN 8.3 NaN
1779 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium NaN 14.7 NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 NaN
1780 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... NaN 8.7 NaN NaN NaN NaN ... 7.1 4.0 1.1 4.1 3.5 2.4 NaN NaN 0.8 NaN

1781 rows × 42 columns

In [8]:
mdf.Country = mdf.CountryCode.apply(get_country_name)
mdf.CountryCode.isna().sum()
Out[8]:
0
In [9]:
bmidf = pd.read_csv("bmi.csv")
bmidf.drop([0, 1], inplace=True)
bmidf.rename({"Unnamed: 0": "Country"}, axis=1, inplace=True)
bmidf.reset_index(inplace=True, drop=True)
criteria = (bmidf.loc[0] == " Both sexes") | (bmidf.loc[0] == "Country")
bmidf = bmidf[
    criteria.index[criteria]
]  # This is a really good way to boolean index columns
bmidf
Out[9]:
Country 2016 2015 2014 2013 2012 2011 2010 2009 2008 ... 1984 1983 1982 1981 1980 1979 1978 1977 1976 1975
0 Country Both sexes Both sexes Both sexes Both sexes Both sexes Both sexes Both sexes Both sexes Both sexes ... Both sexes Both sexes Both sexes Both sexes Both sexes Both sexes Both sexes Both sexes Both sexes Both sexes
1 Afghanistan 23.4 [22.0-24.8] 23.3 [21.9-24.6] 23.2 [21.8-24.5] 23.0 [21.7-24.4] 22.9 [21.6-24.3] 22.8 [21.6-24.1] 22.7 [21.5-24.0] 22.6 [21.4-23.9] 22.5 [21.3-23.8] ... 20.0 [18.3-21.6] 19.9 [18.2-21.6] 19.8 [18.1-21.5] 19.6 [17.9-21.4] 19.5 [17.7-21.3] 19.4 [17.6-21.3] 19.3 [17.4-21.2] 19.2 [17.2-21.1] 19.0 [17.0-21.0] 18.9 [16.9-21.0]
2 Albania 26.7 [25.8-27.5] 26.6 [25.8-27.4] 26.5 [25.8-27.2] 26.4 [25.7-27.1] 26.3 [25.6-26.9] 26.2 [25.6-26.8] 26.1 [25.5-26.7] 26.0 [25.5-26.5] 25.9 [25.4-26.4] ... 24.3 [23.1-25.5] 24.2 [23.0-25.5] 24.2 [22.9-25.5] 24.1 [22.8-25.5] 24.1 [22.6-25.5] 24.0 [22.5-25.5] 23.9 [22.4-25.5] 23.9 [22.3-25.5] 23.8 [22.1-25.5] 23.8 [22.0-25.6]
3 Algeria 25.5 [24.5-26.5] 25.5 [24.5-26.4] 25.4 [24.5-26.2] 25.3 [24.5-26.1] 25.2 [24.5-26.0] 25.1 [24.4-25.8] 25.1 [24.4-25.7] 25.0 [24.4-25.6] 24.9 [24.3-25.5] ... 22.7 [21.5-23.9] 22.6 [21.4-23.9] 22.5 [21.2-23.8] 22.4 [21.1-23.8] 22.3 [20.9-23.8] 22.2 [20.8-23.7] 22.1 [20.6-23.7] 22.0 [20.4-23.7] 22.0 [20.3-23.7] 21.9 [20.1-23.7]
4 Andorra 26.7 [24.6-28.7] 26.7 [24.7-28.7] 26.7 [24.7-28.7] 26.8 [24.8-28.7] 26.8 [24.8-28.7] 26.8 [24.9-28.7] 26.8 [24.9-28.7] 26.8 [25.0-28.7] 26.8 [25.0-28.7] ... 26.0 [24.1-28.0] 26.0 [24.0-27.9] 25.9 [23.9-27.9] 25.9 [23.8-27.9] 25.8 [23.7-27.8] 25.7 [23.6-27.8] 25.6 [23.5-27.7] 25.6 [23.4-27.7] 25.5 [23.3-27.6] 25.4 [23.1-27.6]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
191 Venezuela (Bolivarian Republic of) 26.7 [26.2-27.2] 26.7 [26.2-27.2] 26.6 [26.2-27.1] 26.6 [26.2-27.1] 26.6 [26.2-27.1] 26.6 [26.1-27.0] 26.6 [26.1-27.0] 26.6 [26.1-27.0] 26.5 [26.1-27.0] ... 24.9 [23.7-26.1] 24.8 [23.6-26.0] 24.7 [23.4-26.0] 24.6 [23.3-26.0] 24.5 [23.1-25.9] 24.4 [22.9-25.9] 24.3 [22.8-25.8] 24.2 [22.6-25.8] 24.0 [22.4-25.8] 23.9 [22.2-25.7]
192 Viet Nam 21.9 [21.5-22.3] 21.7 [21.4-22.1] 21.6 [21.3-21.9] 21.5 [21.2-21.7] 21.3 [21.1-21.6] 21.2 [20.9-21.4] 21.0 [20.8-21.3] 20.9 [20.7-21.1] 20.8 [20.6-21.0] ... 18.8 [18.3-19.3] 18.7 [18.2-19.2] 18.6 [18.0-19.2] 18.6 [17.9-19.2] 18.5 [17.8-19.2] 18.4 [17.7-19.2] 18.4 [17.5-19.2] 18.3 [17.4-19.2] 18.2 [17.2-19.2] 18.2 [17.1-19.2]
193 Yemen 23.8 [23.1-24.5] 23.7 [23.1-24.3] 23.6 [23.1-24.2] 23.5 [23.1-24.0] 23.4 [23.0-23.9] 23.4 [23.0-23.8] 23.3 [22.9-23.6] 23.2 [22.8-23.5] 23.1 [22.8-23.4] ... 20.6 [19.6-21.7] 20.5 [19.4-21.7] 20.4 [19.2-21.6] 20.3 [19.1-21.6] 20.2 [18.9-21.5] 20.1 [18.7-21.5] 20.0 [18.6-21.5] 19.9 [18.4-21.4] 19.8 [18.2-21.4] 19.7 [18.0-21.4]
194 Zambia 22.6 [21.7-23.4] 22.5 [21.6-23.3] 22.4 [21.6-23.2] 22.4 [21.6-23.1] 22.3 [21.6-23.0] 22.2 [21.6-22.9] 22.2 [21.5-22.8] 22.1 [21.5-22.7] 22.0 [21.4-22.6] ... 20.5 [19.6-21.5] 20.5 [19.5-21.5] 20.4 [19.3-21.5] 20.2 [19.1-21.4] 20.1 [19.0-21.3] 20.0 [18.8-21.3] 19.9 [18.6-21.2] 19.8 [18.4-21.2] 19.6 [18.2-21.1] 19.5 [18.0-21.1]
195 Zimbabwe 23.8 [23.3-24.3] 23.8 [23.4-24.2] 23.8 [23.4-24.2] 23.7 [23.4-24.1] 23.7 [23.4-24.0] 23.7 [23.3-24.0] 23.6 [23.3-24.0] 23.6 [23.3-23.9] 23.6 [23.3-23.9] ... 22.6 [21.8-23.4] 22.5 [21.6-23.4] 22.4 [21.5-23.4] 22.4 [21.4-23.4] 22.3 [21.3-23.4] 22.3 [21.1-23.4] 22.2 [21.0-23.4] 22.1 [20.9-23.4] 22.1 [20.7-23.5] 22.0 [20.6-23.5]

196 rows × 43 columns

As one can see, bmidf only contains the BMI values for both sexes, so we can remove that row. We can also remove the uncertainty bounds, keeping only the first number which indicates the mean BMI.

In [10]:
bmidf.drop(0, inplace=True)
bmidf.reset_index(inplace=True, drop=True)
bmidf
Out[10]:
Country 2016 2015 2014 2013 2012 2011 2010 2009 2008 ... 1984 1983 1982 1981 1980 1979 1978 1977 1976 1975
0 Afghanistan 23.4 [22.0-24.8] 23.3 [21.9-24.6] 23.2 [21.8-24.5] 23.0 [21.7-24.4] 22.9 [21.6-24.3] 22.8 [21.6-24.1] 22.7 [21.5-24.0] 22.6 [21.4-23.9] 22.5 [21.3-23.8] ... 20.0 [18.3-21.6] 19.9 [18.2-21.6] 19.8 [18.1-21.5] 19.6 [17.9-21.4] 19.5 [17.7-21.3] 19.4 [17.6-21.3] 19.3 [17.4-21.2] 19.2 [17.2-21.1] 19.0 [17.0-21.0] 18.9 [16.9-21.0]
1 Albania 26.7 [25.8-27.5] 26.6 [25.8-27.4] 26.5 [25.8-27.2] 26.4 [25.7-27.1] 26.3 [25.6-26.9] 26.2 [25.6-26.8] 26.1 [25.5-26.7] 26.0 [25.5-26.5] 25.9 [25.4-26.4] ... 24.3 [23.1-25.5] 24.2 [23.0-25.5] 24.2 [22.9-25.5] 24.1 [22.8-25.5] 24.1 [22.6-25.5] 24.0 [22.5-25.5] 23.9 [22.4-25.5] 23.9 [22.3-25.5] 23.8 [22.1-25.5] 23.8 [22.0-25.6]
2 Algeria 25.5 [24.5-26.5] 25.5 [24.5-26.4] 25.4 [24.5-26.2] 25.3 [24.5-26.1] 25.2 [24.5-26.0] 25.1 [24.4-25.8] 25.1 [24.4-25.7] 25.0 [24.4-25.6] 24.9 [24.3-25.5] ... 22.7 [21.5-23.9] 22.6 [21.4-23.9] 22.5 [21.2-23.8] 22.4 [21.1-23.8] 22.3 [20.9-23.8] 22.2 [20.8-23.7] 22.1 [20.6-23.7] 22.0 [20.4-23.7] 22.0 [20.3-23.7] 21.9 [20.1-23.7]
3 Andorra 26.7 [24.6-28.7] 26.7 [24.7-28.7] 26.7 [24.7-28.7] 26.8 [24.8-28.7] 26.8 [24.8-28.7] 26.8 [24.9-28.7] 26.8 [24.9-28.7] 26.8 [25.0-28.7] 26.8 [25.0-28.7] ... 26.0 [24.1-28.0] 26.0 [24.0-27.9] 25.9 [23.9-27.9] 25.9 [23.8-27.9] 25.8 [23.7-27.8] 25.7 [23.6-27.8] 25.6 [23.5-27.7] 25.6 [23.4-27.7] 25.5 [23.3-27.6] 25.4 [23.1-27.6]
4 Angola 23.3 [21.2-25.6] 23.2 [21.1-25.4] 23.2 [21.1-25.3] 23.1 [21.0-25.2] 23.0 [21.0-25.0] 22.9 [20.9-24.9] 22.8 [20.8-24.8] 22.7 [20.7-24.7] 22.6 [20.6-24.5] ... 19.9 [17.8-21.9] 19.8 [17.6-21.8] 19.6 [17.5-21.7] 19.5 [17.3-21.7] 19.4 [17.1-21.6] 19.3 [17.0-21.5] 19.2 [16.8-21.4] 19.0 [16.7-21.4] 18.9 [16.5-21.3] 18.8 [16.3-21.2]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190 Venezuela (Bolivarian Republic of) 26.7 [26.2-27.2] 26.7 [26.2-27.2] 26.6 [26.2-27.1] 26.6 [26.2-27.1] 26.6 [26.2-27.1] 26.6 [26.1-27.0] 26.6 [26.1-27.0] 26.6 [26.1-27.0] 26.5 [26.1-27.0] ... 24.9 [23.7-26.1] 24.8 [23.6-26.0] 24.7 [23.4-26.0] 24.6 [23.3-26.0] 24.5 [23.1-25.9] 24.4 [22.9-25.9] 24.3 [22.8-25.8] 24.2 [22.6-25.8] 24.0 [22.4-25.8] 23.9 [22.2-25.7]
191 Viet Nam 21.9 [21.5-22.3] 21.7 [21.4-22.1] 21.6 [21.3-21.9] 21.5 [21.2-21.7] 21.3 [21.1-21.6] 21.2 [20.9-21.4] 21.0 [20.8-21.3] 20.9 [20.7-21.1] 20.8 [20.6-21.0] ... 18.8 [18.3-19.3] 18.7 [18.2-19.2] 18.6 [18.0-19.2] 18.6 [17.9-19.2] 18.5 [17.8-19.2] 18.4 [17.7-19.2] 18.4 [17.5-19.2] 18.3 [17.4-19.2] 18.2 [17.2-19.2] 18.2 [17.1-19.2]
192 Yemen 23.8 [23.1-24.5] 23.7 [23.1-24.3] 23.6 [23.1-24.2] 23.5 [23.1-24.0] 23.4 [23.0-23.9] 23.4 [23.0-23.8] 23.3 [22.9-23.6] 23.2 [22.8-23.5] 23.1 [22.8-23.4] ... 20.6 [19.6-21.7] 20.5 [19.4-21.7] 20.4 [19.2-21.6] 20.3 [19.1-21.6] 20.2 [18.9-21.5] 20.1 [18.7-21.5] 20.0 [18.6-21.5] 19.9 [18.4-21.4] 19.8 [18.2-21.4] 19.7 [18.0-21.4]
193 Zambia 22.6 [21.7-23.4] 22.5 [21.6-23.3] 22.4 [21.6-23.2] 22.4 [21.6-23.1] 22.3 [21.6-23.0] 22.2 [21.6-22.9] 22.2 [21.5-22.8] 22.1 [21.5-22.7] 22.0 [21.4-22.6] ... 20.5 [19.6-21.5] 20.5 [19.5-21.5] 20.4 [19.3-21.5] 20.2 [19.1-21.4] 20.1 [19.0-21.3] 20.0 [18.8-21.3] 19.9 [18.6-21.2] 19.8 [18.4-21.2] 19.6 [18.2-21.1] 19.5 [18.0-21.1]
194 Zimbabwe 23.8 [23.3-24.3] 23.8 [23.4-24.2] 23.8 [23.4-24.2] 23.7 [23.4-24.1] 23.7 [23.4-24.0] 23.7 [23.3-24.0] 23.6 [23.3-24.0] 23.6 [23.3-23.9] 23.6 [23.3-23.9] ... 22.6 [21.8-23.4] 22.5 [21.6-23.4] 22.4 [21.5-23.4] 22.4 [21.4-23.4] 22.3 [21.3-23.4] 22.3 [21.1-23.4] 22.2 [21.0-23.4] 22.1 [20.9-23.4] 22.1 [20.7-23.5] 22.0 [20.6-23.5]

195 rows × 43 columns

In [11]:
bmicountries = bmidf.loc[:, "Country"]


def check_float(f):
    try:
        float(f)
        return True
    except ValueError:
        return False


bmidf = bmidf.loc[:, "2016":].applymap(
    lambda x: float(x.split(" ")[0]) if check_float(x.split(" ")[0]) else None
)

bmidf.insert(0, "Country", bmicountries)
bmidf
Out[11]:
Country 2016 2015 2014 2013 2012 2011 2010 2009 2008 ... 1984 1983 1982 1981 1980 1979 1978 1977 1976 1975
0 Afghanistan 23.4 23.3 23.2 23.0 22.9 22.8 22.7 22.6 22.5 ... 20.0 19.9 19.8 19.6 19.5 19.4 19.3 19.2 19.0 18.9
1 Albania 26.7 26.6 26.5 26.4 26.3 26.2 26.1 26.0 25.9 ... 24.3 24.2 24.2 24.1 24.1 24.0 23.9 23.9 23.8 23.8
2 Algeria 25.5 25.5 25.4 25.3 25.2 25.1 25.1 25.0 24.9 ... 22.7 22.6 22.5 22.4 22.3 22.2 22.1 22.0 22.0 21.9
3 Andorra 26.7 26.7 26.7 26.8 26.8 26.8 26.8 26.8 26.8 ... 26.0 26.0 25.9 25.9 25.8 25.7 25.6 25.6 25.5 25.4
4 Angola 23.3 23.2 23.2 23.1 23.0 22.9 22.8 22.7 22.6 ... 19.9 19.8 19.6 19.5 19.4 19.3 19.2 19.0 18.9 18.8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190 Venezuela (Bolivarian Republic of) 26.7 26.7 26.6 26.6 26.6 26.6 26.6 26.6 26.5 ... 24.9 24.8 24.7 24.6 24.5 24.4 24.3 24.2 24.0 23.9
191 Viet Nam 21.9 21.7 21.6 21.5 21.3 21.2 21.0 20.9 20.8 ... 18.8 18.7 18.6 18.6 18.5 18.4 18.4 18.3 18.2 18.2
192 Yemen 23.8 23.7 23.6 23.5 23.4 23.4 23.3 23.2 23.1 ... 20.6 20.5 20.4 20.3 20.2 20.1 20.0 19.9 19.8 19.7
193 Zambia 22.6 22.5 22.4 22.4 22.3 22.2 22.2 22.1 22.0 ... 20.5 20.5 20.4 20.2 20.1 20.0 19.9 19.8 19.6 19.5
194 Zimbabwe 23.8 23.8 23.8 23.7 23.7 23.7 23.6 23.6 23.6 ... 22.6 22.5 22.4 22.4 22.3 22.3 22.2 22.1 22.1 22.0

195 rows × 43 columns

Similar to mortality data, BMI data also has some country names inconsistent with the ISO standard. We should be able to fix it using the same map as the one before.

In [12]:
bmidf.replace(countryMapping, inplace=True)
bmidf.insert(0, "CountryCode", bmidf.Country.apply(get_country_code))
bmidf.dropna(inplace=True)
bmidf.Country = bmidf.CountryCode.apply(get_country_name)
display(bmidf.CountryCode.isna().sum())
display(bmidf)
0
CountryCode Country 2016 2015 2014 2013 2012 2011 2010 2009 ... 1984 1983 1982 1981 1980 1979 1978 1977 1976 1975
0 AFG Afghanistan 23.4 23.3 23.2 23.0 22.9 22.8 22.7 22.6 ... 20.0 19.9 19.8 19.6 19.5 19.4 19.3 19.2 19.0 18.9
1 ALB Albania 26.7 26.6 26.5 26.4 26.3 26.2 26.1 26.0 ... 24.3 24.2 24.2 24.1 24.1 24.0 23.9 23.9 23.8 23.8
2 DZA Algeria 25.5 25.5 25.4 25.3 25.2 25.1 25.1 25.0 ... 22.7 22.6 22.5 22.4 22.3 22.2 22.1 22.0 22.0 21.9
3 AND Andorra 26.7 26.7 26.7 26.8 26.8 26.8 26.8 26.8 ... 26.0 26.0 25.9 25.9 25.8 25.7 25.6 25.6 25.5 25.4
4 AGO Angola 23.3 23.2 23.2 23.1 23.0 22.9 22.8 22.7 ... 19.9 19.8 19.6 19.5 19.4 19.3 19.2 19.0 18.9 18.8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190 VEN Venezuela, Bolivarian Republic of 26.7 26.7 26.6 26.6 26.6 26.6 26.6 26.6 ... 24.9 24.8 24.7 24.6 24.5 24.4 24.3 24.2 24.0 23.9
191 VNM Viet Nam 21.9 21.7 21.6 21.5 21.3 21.2 21.0 20.9 ... 18.8 18.7 18.6 18.6 18.5 18.4 18.4 18.3 18.2 18.2
192 YEM Yemen 23.8 23.7 23.6 23.5 23.4 23.4 23.3 23.2 ... 20.6 20.5 20.4 20.3 20.2 20.1 20.0 19.9 19.8 19.7
193 ZMB Zambia 22.6 22.5 22.4 22.4 22.3 22.2 22.2 22.1 ... 20.5 20.5 20.4 20.2 20.1 20.0 19.9 19.8 19.6 19.5
194 ZWE Zimbabwe 23.8 23.8 23.8 23.7 23.7 23.7 23.6 23.6 ... 22.6 22.5 22.4 22.4 22.3 22.3 22.2 22.1 22.1 22.0

186 rows × 44 columns

Also reverse the year numbers to make plotting graphs easier.

In [13]:
bmidf = bmidf[["CountryCode", "Country"] + list(bmidf.columns[:1:-1])]
bmidf
Out[13]:
CountryCode Country 1975 1976 1977 1978 1979 1980 1981 1982 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 AFG Afghanistan 18.9 19.0 19.2 19.3 19.4 19.5 19.6 19.8 ... 22.4 22.5 22.6 22.7 22.8 22.9 23.0 23.2 23.3 23.4
1 ALB Albania 23.8 23.8 23.9 23.9 24.0 24.1 24.1 24.2 ... 25.8 25.9 26.0 26.1 26.2 26.3 26.4 26.5 26.6 26.7
2 DZA Algeria 21.9 22.0 22.0 22.1 22.2 22.3 22.4 22.5 ... 24.8 24.9 25.0 25.1 25.1 25.2 25.3 25.4 25.5 25.5
3 AND Andorra 25.4 25.5 25.6 25.6 25.7 25.8 25.9 25.9 ... 26.8 26.8 26.8 26.8 26.8 26.8 26.8 26.7 26.7 26.7
4 AGO Angola 18.8 18.9 19.0 19.2 19.3 19.4 19.5 19.6 ... 22.5 22.6 22.7 22.8 22.9 23.0 23.1 23.2 23.2 23.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190 VEN Venezuela, Bolivarian Republic of 23.9 24.0 24.2 24.3 24.4 24.5 24.6 24.7 ... 26.5 26.5 26.6 26.6 26.6 26.6 26.6 26.6 26.7 26.7
191 VNM Viet Nam 18.2 18.2 18.3 18.4 18.4 18.5 18.6 18.6 ... 20.7 20.8 20.9 21.0 21.2 21.3 21.5 21.6 21.7 21.9
192 YEM Yemen 19.7 19.8 19.9 20.0 20.1 20.2 20.3 20.4 ... 23.0 23.1 23.2 23.3 23.4 23.4 23.5 23.6 23.7 23.8
193 ZMB Zambia 19.5 19.6 19.8 19.9 20.0 20.1 20.2 20.4 ... 22.0 22.0 22.1 22.2 22.2 22.3 22.4 22.4 22.5 22.6
194 ZWE Zimbabwe 22.0 22.1 22.1 22.2 22.3 22.3 22.4 22.4 ... 23.5 23.6 23.6 23.6 23.7 23.7 23.7 23.8 23.8 23.8

186 rows × 44 columns

Remove data prior to 1979:

In [14]:
bmidf.drop([str(x) for x in range(1975, 1979)], axis=1, inplace=True)
bmidf.reset_index(inplace=True, drop=True)
bmidf
Out[14]:
CountryCode Country 1979 1980 1981 1982 1983 1984 1985 1986 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 AFG Afghanistan 19.4 19.5 19.6 19.8 19.9 20.0 20.1 20.2 ... 22.4 22.5 22.6 22.7 22.8 22.9 23.0 23.2 23.3 23.4
1 ALB Albania 24.0 24.1 24.1 24.2 24.2 24.3 24.3 24.4 ... 25.8 25.9 26.0 26.1 26.2 26.3 26.4 26.5 26.6 26.7
2 DZA Algeria 22.2 22.3 22.4 22.5 22.6 22.7 22.8 22.9 ... 24.8 24.9 25.0 25.1 25.1 25.2 25.3 25.4 25.5 25.5
3 AND Andorra 25.7 25.8 25.9 25.9 26.0 26.0 26.1 26.2 ... 26.8 26.8 26.8 26.8 26.8 26.8 26.8 26.7 26.7 26.7
4 AGO Angola 19.3 19.4 19.5 19.6 19.8 19.9 20.0 20.1 ... 22.5 22.6 22.7 22.8 22.9 23.0 23.1 23.2 23.2 23.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
181 VEN Venezuela, Bolivarian Republic of 24.4 24.5 24.6 24.7 24.8 24.9 25.0 25.1 ... 26.5 26.5 26.6 26.6 26.6 26.6 26.6 26.6 26.7 26.7
182 VNM Viet Nam 18.4 18.5 18.6 18.6 18.7 18.8 18.8 18.9 ... 20.7 20.8 20.9 21.0 21.2 21.3 21.5 21.6 21.7 21.9
183 YEM Yemen 20.1 20.2 20.3 20.4 20.5 20.6 20.7 20.8 ... 23.0 23.1 23.2 23.3 23.4 23.4 23.5 23.6 23.7 23.8
184 ZMB Zambia 20.0 20.1 20.2 20.4 20.5 20.5 20.6 20.7 ... 22.0 22.0 22.1 22.2 22.2 22.3 22.4 22.4 22.5 22.6
185 ZWE Zimbabwe 22.3 22.3 22.4 22.4 22.5 22.6 22.6 22.7 ... 23.5 23.6 23.6 23.6 23.7 23.7 23.7 23.8 23.8 23.8

186 rows × 40 columns

Check for missing data:

In [15]:
display(bmidf.dtypes.value_counts())
display(bmidf.isna().sum().sum())
float64    38
object      2
dtype: int64
0
In [16]:
cmrdf = pd.read_excel("cmr.xlsx", sheet_name=0, header=14)
cmrdf = cmrdf.loc[:584]
cmrdf
Out[16]:
ISO.Code Country.Name Uncertainty.Bounds* 1950.5 1951.5 1952.5 1953.5 1954.5 1955.5 1956.5 ... 2010.5 2011.5 2012.5 2013.5 2014.5 2015.5 2016.5 2017.5 2018.5 2019.5
0 AFG Afghanistan Lower NaN NaN NaN NaN NaN NaN NaN ... 59.418411 56.920935 54.351492 51.909067 49.342518 46.814268 44.356079 42.077671 39.944302 37.705916
1 AFG Afghanistan Median NaN NaN NaN NaN NaN NaN NaN ... 64.100567 61.726441 59.454642 57.225151 55.160932 53.169807 51.319141 49.611428 48.043335 46.512825
2 AFG Afghanistan Upper NaN NaN NaN NaN NaN NaN NaN ... 69.265736 67.089883 65.032220 63.113255 61.386848 59.787768 58.378906 57.226683 56.320057 55.818684
3 ALB Albania Lower NaN NaN NaN NaN NaN NaN NaN ... 11.453889 10.469362 9.625089 8.963378 8.495364 8.205437 8.071430 8.043587 8.064059 8.064743
4 ALB Albania Median NaN NaN NaN NaN NaN NaN NaN ... 11.828097 10.825056 9.963018 9.284420 8.805975 8.509460 8.378718 8.377771 8.473569 8.614279
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
580 ZMB Zambia Median NaN NaN NaN 138.522662 135.831489 133.228545 130.723865 ... 52.323193 51.028241 49.923447 48.746112 47.472249 46.721240 45.789160 44.124429 43.364637 42.415261
581 ZMB Zambia Upper NaN NaN NaN 183.501394 171.798482 162.555913 154.748751 ... 56.131591 54.938805 53.856176 52.769114 51.695715 51.369981 51.129050 50.416010 50.962951 51.185193
582 ZWE Zimbabwe Lower NaN NaN NaN NaN 79.678893 82.530744 84.233003 ... 50.099059 47.932403 44.146881 41.985616 40.191513 38.455971 36.468988 34.735030 32.471826 30.455060
583 ZWE Zimbabwe Median NaN NaN NaN NaN 104.305814 102.558033 100.831113 ... 53.776518 51.631998 47.853394 45.864949 44.245706 42.868009 41.392057 40.517294 39.273756 38.423776
584 ZWE Zimbabwe Upper NaN NaN NaN NaN 139.889589 129.061901 121.647595 ... 57.893104 55.636366 51.758305 49.938442 48.541491 47.714346 46.917564 47.059759 47.225792 47.928436

585 rows × 73 columns

Note that we only need the Median rows of CMR. I will also standardize the column names.

In [17]:
cmrdf = cmrdf[cmrdf["Uncertainty.Bounds*"] == "Median"]
cmrdf.reset_index(inplace=True, drop=True)
cmrdf.drop("Uncertainty.Bounds*", axis=1, inplace=True)
cmrdf.rename(
    {"ISO.Code": "CountryCode", "Country.Name": "Country"}, axis=1, inplace=True
)
cmrdf
Out[17]:
CountryCode Country 1950.5 1951.5 1952.5 1953.5 1954.5 1955.5 1956.5 1957.5 ... 2010.5 2011.5 2012.5 2013.5 2014.5 2015.5 2016.5 2017.5 2018.5 2019.5
0 AFG Afghanistan NaN NaN NaN NaN NaN NaN NaN NaN ... 64.100567 61.726441 59.454642 57.225151 55.160932 53.169807 51.319141 49.611428 48.043335 46.512825
1 ALB Albania NaN NaN NaN NaN NaN NaN NaN NaN ... 11.828097 10.825056 9.963018 9.284420 8.805975 8.509460 8.378718 8.377771 8.473569 8.614279
2 DZA Algeria NaN NaN NaN NaN 146.488512 146.230842 146.013478 145.781106 ... 23.562945 22.933794 22.471525 22.137349 21.873076 21.612694 21.280396 20.872521 20.429315 19.954867
3 AND Andorra NaN NaN NaN NaN NaN NaN NaN NaN ... 4.256321 4.043769 3.850386 3.665888 3.495399 3.340385 3.200425 3.064820 2.941386 2.831607
4 AGO Angola NaN NaN NaN NaN NaN NaN NaN NaN ... 75.868650 71.420473 67.316445 63.715246 60.540468 57.816653 55.562227 53.535175 51.871440 50.180155
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190 VEN Venezuela (Bolivarian Republic of) NaN 81.653269 78.394203 75.444734 72.594348 69.931813 67.321776 64.899605 ... 14.666083 14.743388 14.887073 15.119070 15.431784 16.436529 21.044404 21.038954 21.038436 21.038436
191 VNM Viet Nam NaN NaN NaN NaN NaN NaN NaN NaN ... 18.261223 18.038662 17.818400 17.617092 17.417050 17.201177 16.937983 16.628510 16.269846 15.884507
192 YEM Yemen NaN NaN NaN NaN NaN NaN NaN NaN ... 43.634253 43.077472 42.964150 42.962961 42.986121 43.358838 43.310794 43.341707 43.639697 43.623031
193 ZMB Zambia NaN NaN NaN 138.522662 135.831489 133.228545 130.723865 128.427292 ... 52.323193 51.028241 49.923447 48.746112 47.472249 46.721240 45.789160 44.124429 43.364637 42.415261
194 ZWE Zimbabwe NaN NaN NaN NaN 104.305814 102.558033 100.831113 99.052944 ... 53.776518 51.631998 47.853394 45.864949 44.245706 42.868009 41.392057 40.517294 39.273756 38.423776

195 rows × 72 columns

Let us standardize the year naming scheme.

In [18]:
cmrdf.columns = cmrdf.columns.str.replace("\.5", "")
cmrdf
Out[18]:
CountryCode Country 1950 1951 1952 1953 1954 1955 1956 1957 ... 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 AFG Afghanistan NaN NaN NaN NaN NaN NaN NaN NaN ... 64.100567 61.726441 59.454642 57.225151 55.160932 53.169807 51.319141 49.611428 48.043335 46.512825
1 ALB Albania NaN NaN NaN NaN NaN NaN NaN NaN ... 11.828097 10.825056 9.963018 9.284420 8.805975 8.509460 8.378718 8.377771 8.473569 8.614279
2 DZA Algeria NaN NaN NaN NaN 146.488512 146.230842 146.013478 145.781106 ... 23.562945 22.933794 22.471525 22.137349 21.873076 21.612694 21.280396 20.872521 20.429315 19.954867
3 AND Andorra NaN NaN NaN NaN NaN NaN NaN NaN ... 4.256321 4.043769 3.850386 3.665888 3.495399 3.340385 3.200425 3.064820 2.941386 2.831607
4 AGO Angola NaN NaN NaN NaN NaN NaN NaN NaN ... 75.868650 71.420473 67.316445 63.715246 60.540468 57.816653 55.562227 53.535175 51.871440 50.180155
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190 VEN Venezuela (Bolivarian Republic of) NaN 81.653269 78.394203 75.444734 72.594348 69.931813 67.321776 64.899605 ... 14.666083 14.743388 14.887073 15.119070 15.431784 16.436529 21.044404 21.038954 21.038436 21.038436
191 VNM Viet Nam NaN NaN NaN NaN NaN NaN NaN NaN ... 18.261223 18.038662 17.818400 17.617092 17.417050 17.201177 16.937983 16.628510 16.269846 15.884507
192 YEM Yemen NaN NaN NaN NaN NaN NaN NaN NaN ... 43.634253 43.077472 42.964150 42.962961 42.986121 43.358838 43.310794 43.341707 43.639697 43.623031
193 ZMB Zambia NaN NaN NaN 138.522662 135.831489 133.228545 130.723865 128.427292 ... 52.323193 51.028241 49.923447 48.746112 47.472249 46.721240 45.789160 44.124429 43.364637 42.415261
194 ZWE Zimbabwe NaN NaN NaN NaN 104.305814 102.558033 100.831113 99.052944 ... 53.776518 51.631998 47.853394 45.864949 44.245706 42.868009 41.392057 40.517294 39.273756 38.423776

195 rows × 72 columns

In [19]:
display(
    cmrdf.isna()
    .sum()
    .plot(kind="bar", figsize=(15, 10), xlabel="Column", ylabel="Number of NaN")
)
<matplotlib.axes._subplots.AxesSubplot at 0x2354211a358>

The NaN values seem to be concentrated towards the datapoints taken earlier, mostly before the 1980s. Data before 1979 is useless anyways, so there shouldn't be too much impact leaving a few nans in.

Let us remove the data prior to 1979 and after 2016.

In [20]:
cmrdf.drop(
    [str(x) for x in list(range(1950, 1979)) + list(range(2017, 2020))],
    axis=1,
    inplace=True,
)
cmrdf
Out[20]:
CountryCode Country 1979 1980 1981 1982 1983 1984 1985 1986 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 AFG Afghanistan 165.257379 161.028132 156.763436 152.481639 148.354144 144.321605 140.257604 136.247094 ... 71.660270 69.035406 66.525823 64.100567 61.726441 59.454642 57.225151 55.160932 53.169807 51.319141
1 ALB Albania 71.181438 66.155752 61.475555 57.115470 53.214275 49.681820 46.508865 43.680472 ... 15.292477 14.077470 12.914518 11.828097 10.825056 9.963018 9.284420 8.805975 8.509460 8.378718
2 DZA Algeria 108.045384 101.794747 94.546163 86.274384 77.023313 67.696731 59.545794 53.206759 ... 26.437225 25.349898 24.381526 23.562945 22.933794 22.471525 22.137349 21.873076 21.612694 21.280396
3 AND Andorra NaN NaN NaN NaN NaN NaN 10.106871 9.893521 ... 4.916380 4.680900 4.459207 4.256321 4.043769 3.850386 3.665888 3.495399 3.340385 3.200425
4 AGO Angola NaN 140.238393 138.636313 137.023220 135.410888 134.155570 133.220981 132.510430 ... 90.490680 85.345245 80.484163 75.868650 71.420473 67.316445 63.715246 60.540468 57.816653 55.562227
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190 VEN Venezuela (Bolivarian Republic of) 36.571065 35.186548 33.864139 32.649864 31.545204 30.547752 29.618986 28.669964 ... 14.982381 14.745607 14.653889 14.666083 14.743388 14.887073 15.119070 15.431784 16.436529 21.044404
191 VNM Viet Nam 47.531900 46.631146 45.737060 44.938104 44.172720 43.422350 42.616956 41.720032 ... 19.067110 18.766418 18.495580 18.261223 18.038662 17.818400 17.617092 17.417050 17.201177 16.937983
192 YEM Yemen 147.133742 139.228105 131.385962 123.940729 117.077235 111.019952 105.715362 101.186844 ... 49.343347 46.984256 44.940469 43.634253 43.077472 42.964150 42.962961 42.986121 43.358838 43.310794
193 ZMB Zambia 95.096937 95.596715 96.032758 96.717562 97.831035 99.456249 101.373502 103.377671 ... 58.374573 55.919851 54.130195 52.323193 51.028241 49.923447 48.746112 47.472249 46.721240 45.789160
194 ZWE Zimbabwe 70.143121 68.574404 66.204777 63.317586 60.084513 56.855553 53.973998 51.651692 ... 56.202805 56.418056 55.866004 53.776518 51.631998 47.853394 45.864949 44.245706 42.868009 41.392057

195 rows × 40 columns

cmrdf suffers from the same problem of iso-uncompliant country names, so let's fix that right now.

In [21]:
cmrdf.Country = cmrdf.CountryCode.apply(get_country_name)
display(cmrdf)
CountryCode Country 1979 1980 1981 1982 1983 1984 1985 1986 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 AFG Afghanistan 165.257379 161.028132 156.763436 152.481639 148.354144 144.321605 140.257604 136.247094 ... 71.660270 69.035406 66.525823 64.100567 61.726441 59.454642 57.225151 55.160932 53.169807 51.319141
1 ALB Albania 71.181438 66.155752 61.475555 57.115470 53.214275 49.681820 46.508865 43.680472 ... 15.292477 14.077470 12.914518 11.828097 10.825056 9.963018 9.284420 8.805975 8.509460 8.378718
2 DZA Algeria 108.045384 101.794747 94.546163 86.274384 77.023313 67.696731 59.545794 53.206759 ... 26.437225 25.349898 24.381526 23.562945 22.933794 22.471525 22.137349 21.873076 21.612694 21.280396
3 AND Andorra NaN NaN NaN NaN NaN NaN 10.106871 9.893521 ... 4.916380 4.680900 4.459207 4.256321 4.043769 3.850386 3.665888 3.495399 3.340385 3.200425
4 AGO Angola NaN 140.238393 138.636313 137.023220 135.410888 134.155570 133.220981 132.510430 ... 90.490680 85.345245 80.484163 75.868650 71.420473 67.316445 63.715246 60.540468 57.816653 55.562227
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190 VEN Venezuela, Bolivarian Republic of 36.571065 35.186548 33.864139 32.649864 31.545204 30.547752 29.618986 28.669964 ... 14.982381 14.745607 14.653889 14.666083 14.743388 14.887073 15.119070 15.431784 16.436529 21.044404
191 VNM Viet Nam 47.531900 46.631146 45.737060 44.938104 44.172720 43.422350 42.616956 41.720032 ... 19.067110 18.766418 18.495580 18.261223 18.038662 17.818400 17.617092 17.417050 17.201177 16.937983
192 YEM Yemen 147.133742 139.228105 131.385962 123.940729 117.077235 111.019952 105.715362 101.186844 ... 49.343347 46.984256 44.940469 43.634253 43.077472 42.964150 42.962961 42.986121 43.358838 43.310794
193 ZMB Zambia 95.096937 95.596715 96.032758 96.717562 97.831035 99.456249 101.373502 103.377671 ... 58.374573 55.919851 54.130195 52.323193 51.028241 49.923447 48.746112 47.472249 46.721240 45.789160
194 ZWE Zimbabwe 70.143121 68.574404 66.204777 63.317586 60.084513 56.855553 53.973998 51.651692 ... 56.202805 56.418056 55.866004 53.776518 51.631998 47.853394 45.864949 44.245706 42.868009 41.392057

195 rows × 40 columns

In [22]:
df = pd.read_excel("hdi.xlsx", sheet_name=0)
with pd.option_context("display.max_rows", None):
    display(df)
Unnamed: 0 Table 1. Human Development Index and its components Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10 Unnamed: 11 Unnamed: 12 Unnamed: 13 Unnamed: 14
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN SDG3 NaN SDG4.3 NaN SDG4.6 NaN SDG8.5 NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN Human development index (HDI) NaN Life expectancy at birth NaN Expected years of schooling NaN Mean years of schooling NaN Gross national income (GNI) per capita NaN GNI per capita rank minus HDI rank NaN HDI rank
4 HDI rank Country (index value) NaN (years) NaN (years) NaN (years) NaN (2011 PPP $) NaN NaN NaN NaN
5 NaN NaN 2018 NaN 2018 NaN 2018 a 2018 a 2018 NaN 2018 NaN 2017
6 NaN VERY HIGH HUMAN DEVELOPMENT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 1 Norway 0.953688 NaN 82.271 NaN 18.0608 b 12.5668 NaN 68058.6 NaN 5 NaN 1
8 2 Switzerland 0.945936 NaN 83.63 NaN 16.2088 NaN 13.3808 NaN 59374.7 NaN 8 NaN 2
9 3 Ireland 0.942473 NaN 82.103 NaN 18.7933 b 12.5263 c 55659.7 NaN 9 NaN 3
10 4 Germany 0.938785 NaN 81.18 NaN 17.0964 NaN 14.1321 NaN 46945.9 NaN 15 NaN 4
11 4 Hong Kong, China (SAR) 0.938809 NaN 84.687 NaN 16.5122 NaN 12.0381 NaN 60220.8 NaN 5 NaN 6
12 6 Australia 0.938379 NaN 83.281 NaN 22.1037 b 12.683 c 44097 NaN 15 NaN 5
13 6 Iceland 0.938474 NaN 82.855 NaN 19.1745 b 12.5367 c 47566.5 NaN 12 NaN 7
14 8 Sweden 0.936628 NaN 82.654 NaN 18.8322 b 12.426 NaN 47955.4 NaN 9 NaN 7
15 9 Singapore 0.934819 NaN 83.458 NaN 16.3282 NaN 11.4965 NaN 83792.7 d -6 NaN 9
16 10 Netherlands 0.933495 NaN 82.143 NaN 18.0448 b 12.19 NaN 50012.6 NaN 3 NaN 10
17 11 Denmark 0.929946 NaN 80.784 NaN 19.0661 b 12.5878 NaN 48836.1 NaN 4 NaN 11
18 12 Finland 0.925192 NaN 81.736 NaN 19.3164 b 12.4396 NaN 41779.3 NaN 12 NaN 12
19 13 Canada 0.922111 NaN 82.315 NaN 16.0914 NaN 13.3154 c 43602.2 NaN 10 NaN 13
20 14 New Zealand 0.920886 NaN 82.145 NaN 18.8376 b 12.6782 c 35107.5 NaN 18 NaN 14
21 15 United Kingdom 0.920352 NaN 81.236 NaN 17.4429 NaN 12.9507 e 39507.3 NaN 13 NaN 15
22 15 United States 0.919926 NaN 78.851 NaN 16.2744 NaN 13.4134 NaN 56140.2 NaN -4 NaN 15
23 17 Belgium 0.918841 NaN 81.468 NaN 19.6999 b 11.7839 NaN 43820.8 NaN 5 NaN 17
24 18 Liechtenstein 0.916724 NaN 80.537 f 14.7209 NaN 12.5485 g 99732.1 d,h -16 NaN 18
25 19 Japan 0.914696 NaN 84.47 NaN 15.2306 NaN 12.8 i 40799 NaN 6 NaN 19
26 20 Austria 0.913809 NaN 81.434 NaN 16.2868 NaN 12.5585 NaN 46230.6 NaN 0 NaN 20
27 21 Luxembourg 0.90872 NaN 82.102 NaN 14.2275 NaN 12.1958 e 65543.1 NaN -13 NaN 21
28 22 Israel 0.906235 NaN 82.819 NaN 15.9912 NaN 12.9592 NaN 33649.7 NaN 13 NaN 22
29 22 Korea (Republic of) 0.905832 NaN 82.846 NaN 16.3938 NaN 12.185 NaN 36757 NaN 8 NaN 22
30 24 Slovenia 0.901601 NaN 81.172 NaN 17.4222 NaN 12.2734 NaN 32143 NaN 13 NaN 24
31 25 Spain 0.892788 NaN 83.433 NaN 17.877 NaN 9.81948 NaN 35041.3 NaN 8 NaN 25
32 26 Czechia 0.890755 NaN 79.222 NaN 16.8316 NaN 12.7404 NaN 31597.1 NaN 12 NaN 27
33 26 France 0.891057 NaN 82.541 NaN 15.4858 NaN 11.417 NaN 40510.8 NaN 0 NaN 26
34 28 Malta 0.885266 NaN 82.376 NaN 15.8966 NaN 11.2881 NaN 34795.2 NaN 6 NaN 28
35 29 Italy 0.882584 NaN 83.352 NaN 16.2469 NaN 10.2451 e 36141.4 NaN 2 NaN 29
36 30 Estonia 0.881542 NaN 78.567 NaN 16.0636 NaN 13.029 c 30378.6 NaN 10 NaN 30
37 31 Cyprus 0.872955 NaN 80.828 NaN 14.674 NaN 12.1039 NaN 33100.3 NaN 5 NaN 31
38 32 Greece 0.871952 NaN 82.072 NaN 17.3367 NaN 10.5396 NaN 24909.3 NaN 20 NaN 31
39 32 Poland 0.871768 NaN 78.544 NaN 16.4332 NaN 12.294 NaN 27625.8 NaN 13 NaN 33
40 34 Lithuania 0.869308 NaN 75.737 NaN 16.4995 NaN 12.961 NaN 29775.3 NaN 7 NaN 34
41 35 United Arab Emirates 0.866438 NaN 77.814 NaN 13.6434 NaN 10.9541 NaN 66911.7 NaN -28 NaN 35
42 36 Andorra 0.856781 NaN 81.786 f 13.3002 j 10.1555 NaN 48640.9 k -20 NaN 38
43 36 Saudi Arabia 0.856963 NaN 74.998 NaN 16.9791 e 9.67145 e 49338.4 NaN -22 NaN 36
44 36 Slovakia 0.856878 NaN 77.388 NaN 14.5331 NaN 12.6056 c 30671.9 NaN 3 NaN 37
45 39 Latvia 0.853902 NaN 75.172 NaN 15.9827 NaN 12.8254 c 26300.8 NaN 10 NaN 39
46 40 Portugal 0.850212 NaN 81.857 NaN 16.2965 NaN 9.19096 NaN 27935.4 NaN 4 NaN 40
47 41 Qatar 0.848441 NaN 80.1 NaN 12.1789 NaN 9.66734 NaN 110489 d -40 NaN 40
48 42 Chile 0.846948 NaN 80.042 NaN 16.5316 NaN 10.447 NaN 21972.3 NaN 17 NaN 42
49 43 Brunei Darussalam 0.844587 NaN 75.722 NaN 14.3801 NaN 9.1 i 76388.5 d -39 NaN 43
50 43 Hungary 0.8447 NaN 76.701 NaN 15.1171 NaN 11.8894 NaN 27144.2 NaN 4 NaN 44
51 45 Bahrain 0.83777 NaN 77.163 NaN 15.2571 NaN 9.41183 e 40399.1 NaN -18 NaN 45
52 46 Croatia 0.837313 NaN 78.342 NaN 14.9625 NaN 11.4052 e 23061 NaN 9 NaN 46
53 47 Oman 0.833789 NaN 77.633 NaN 14.6641 NaN 9.73185 NaN 37039.2 NaN -18 NaN 47
54 48 Argentina 0.830096 NaN 76.52 NaN 17.6384 NaN 10.565 c 17611.2 NaN 18 NaN 48
55 49 Russian Federation 0.824038 NaN 72.386 NaN 15.5357 NaN 12.02 e 25036 NaN 2 NaN 49
56 50 Belarus 0.817068 NaN 74.59 NaN 15.3589 NaN 12.3058 l 17038.5 NaN 18 NaN 50
57 50 Kazakhstan 0.817244 NaN 73.236 NaN 15.2652 NaN 11.784 i 22167.7 NaN 8 NaN 51
58 52 Bulgaria 0.815711 NaN 74.928 NaN 14.8105 NaN 11.8148 NaN 19645.9 NaN 9 NaN 51
59 52 Montenegro 0.815934 NaN 76.77 NaN 15.0319 NaN 11.3868 e 17510.7 NaN 15 NaN 51
60 52 Romania 0.81557 NaN 75.924 NaN 14.2635 NaN 10.9784 NaN 23905.8 NaN 2 NaN 51
61 55 Palau 0.814246 NaN 73.678 f 15.5532 e 12.4 e 16720 NaN 14 NaN 56
62 56 Barbados 0.81331 NaN 79.081 NaN 15.1557 e 10.5569 m 15912.3 NaN 18 NaN 51
63 57 Kuwait 0.808352 NaN 75.398 NaN 13.7588 NaN 7.27567 NaN 71164.2 NaN -52 NaN 57
64 57 Uruguay 0.807764 NaN 77.77 NaN 16.3448 NaN 8.72872 NaN 19434.9 NaN 5 NaN 58
65 59 Turkey 0.80649 NaN 77.437 NaN 16.4391 e 7.66791 NaN 24905.4 NaN -6 NaN 59
66 60 Bahamas 0.805483 NaN 73.752 NaN 12.8226 n 11.533 e 28395.4 NaN -17 NaN 60
67 61 Malaysia 0.804186 NaN 75.997 NaN 13.4688 NaN 10.16 NaN 27226.7 NaN -15 NaN 61
68 62 Seychelles 0.801395 NaN 73.333 NaN 15.4525 NaN 9.67281 j 25076.9 NaN -12 NaN 62
69 NaN HIGH HUMAN DEVELOPMENT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
70 63 Serbia 0.79926 NaN 75.849 NaN 14.7661 NaN 11.1807 NaN 15217.7 NaN 15 NaN 65
71 63 Trinidad and Tobago 0.799028 NaN 73.38 NaN 12.96 e 11.0259 l 28497.4 NaN -21 NaN 63
72 65 Iran (Islamic Republic of) 0.797483 NaN 76.479 NaN 14.7268 NaN 10.0115 NaN 18166.5 NaN -2 NaN 63
73 66 Mauritius 0.796406 NaN 74.863 NaN 14.9698 NaN 9.43 i 22724.2 NaN -10 NaN 66
74 67 Panama 0.795055 NaN 78.329 NaN 12.8806 NaN 10.17 i 20454.9 NaN -7 NaN 66
75 68 Costa Rica 0.79352 NaN 80.095 NaN 15.3771 NaN 8.66718 NaN 14789.9 NaN 12 NaN 68
76 69 Albania 0.791406 NaN 78.458 NaN 15.23 NaN 10.0546 m 12299.8 NaN 20 NaN 69
77 70 Georgia 0.786359 NaN 73.6 NaN 15.4331 NaN 12.8146 NaN 9569.52 NaN 34 NaN 70
78 71 Sri Lanka 0.780115 NaN 76.812 NaN 13.9653 NaN 11.0518 e 11610.9 NaN 24 NaN 72
79 72 Cuba 0.777715 NaN 78.726 NaN 14.3702 NaN 11.7508 e 7811.36 o 43 NaN 71
80 73 Saint Kitts and Nevis 0.776815 NaN 74.564 f 13.608 e 8.5 n 26770.1 NaN -25 NaN 73
81 74 Antigua and Barbuda 0.776163 NaN 76.885 NaN 12.4525 e 9.26309 j 22201.2 NaN -17 NaN 73
82 75 Bosnia and Herzegovina 0.769187 NaN 77.262 NaN 13.7876 j 9.69292 NaN 12689.7 NaN 10 NaN 75
83 76 Mexico 0.767447 NaN 74.992 NaN 14.2974 NaN 8.60085 NaN 17628.1 NaN -11 NaN 76
84 77 Thailand 0.764605 NaN 76.931 NaN 14.6503 e 7.7314 NaN 16128.5 NaN -6 NaN 77
85 78 Grenada 0.763383 NaN 72.384 NaN 16.6045 NaN 8.8 n 12683.8 NaN 8 NaN 78
86 79 Brazil 0.761153 NaN 75.672 NaN 15.3969 NaN 7.84105 e 14068 NaN 2 NaN 78
87 79 Colombia 0.760895 NaN 77.109 NaN 14.5981 NaN 8.32658 NaN 12895.6 NaN 4 NaN 78
88 81 Armenia 0.760011 NaN 74.945 NaN 13.1684 e 11.7939 NaN 9277.23 NaN 26 NaN 81
89 82 Algeria 0.758978 NaN 76.693 NaN 14.7194 e 7.98655 l 13639.4 NaN 0 NaN 81
90 82 North Macedonia 0.759422 NaN 75.688 NaN 13.464 NaN 9.68012 l 12873.8 NaN 2 NaN 81
91 82 Peru 0.759104 NaN 76.516 NaN 13.8474 NaN 9.2154 NaN 12322.7 NaN 6 NaN 85
92 85 China 0.757587 NaN 76.704 NaN 13.8889 e 7.9 m 16126.6 NaN -13 NaN 86
93 85 Ecuador 0.757916 NaN 76.8 NaN 14.923 e 8.98551 NaN 10141.1 NaN 17 NaN 84
94 87 Azerbaijan 0.753869 NaN 72.864 NaN 12.4012 e 10.4797 NaN 15240.1 NaN -10 NaN 87
95 88 Ukraine 0.749746 NaN 71.953 NaN 15.0738 e 11.34 m 7994.21 NaN 25 NaN 88
96 89 Dominican Republic 0.744641 NaN 73.892 NaN 14.1408 NaN 7.93535 NaN 15074.3 NaN -10 NaN 90
97 89 Saint Lucia 0.744931 NaN 76.057 NaN 13.8728 e 8.49149 NaN 11528.4 NaN 7 NaN 89
98 91 Tunisia 0.739201 NaN 76.505 NaN 15.1015 NaN 7.17226 e 10677 NaN 10 NaN 91
99 92 Mongolia 0.734684 NaN 69.689 NaN 14.2066 e 10.172 e 10783.7 NaN 7 NaN 94
100 93 Lebanon 0.730078 NaN 78.875 NaN 11.2859 NaN 8.7 n 11136.2 NaN 5 NaN 93
101 94 Botswana 0.727787 NaN 69.275 NaN 12.6972 e 9.33 m 15951.3 NaN -21 NaN 97
102 94 Saint Vincent and the Grenadines 0.727865 NaN 72.415 NaN 13.5717 e 8.61654 n 11746.4 NaN -2 NaN 95
103 96 Jamaica 0.725679 NaN 74.368 NaN 13.1407 e 9.79691 e 7931.52 NaN 18 NaN 96
104 96 Venezuela (Bolivarian Republic of) 0.725773 NaN 72.128 NaN 12.8161 e 10.3234 NaN 9069.7 p 14 NaN 92
105 98 Dominica 0.723788 NaN 78.119 f 12.9662 e 7.8 j 9245.16 NaN 10 NaN 98
106 98 Fiji 0.723699 NaN 67.341 NaN 14.4273 e 10.884 i 9110.44 NaN 11 NaN 102
107 98 Paraguay 0.724348 NaN 74.131 NaN 12.6899 e 8.45044 NaN 11720 NaN -5 NaN 99
108 98 Suriname 0.723693 NaN 71.57 NaN 12.8588 e 9.12537 NaN 11933 NaN -8 NaN 99
109 102 Jordan 0.723445 NaN 74.405 NaN 11.8767 e 10.452 i 8267.81 NaN 10 NaN 99
110 103 Belize 0.72017 NaN 74.496 NaN 13.1233 NaN 9.79584 l 7135.97 NaN 17 NaN 103
111 104 Maldives 0.718707 NaN 78.627 NaN 12.118 q 6.818 q 12549.3 NaN -17 NaN 105
112 105 Tonga 0.71743 NaN 70.801 NaN 14.3 e 11.21 i 5782.57 NaN 26 NaN 104
113 106 Philippines 0.71186 NaN 71.095 NaN 12.7225 e 9.39309 e 9539.7 NaN -1 NaN 106
114 107 Moldova (Republic of) 0.711452 NaN 71.808 NaN 11.6139 NaN 11.5628 NaN 6833.11 NaN 16 NaN 106
115 108 Turkmenistan 0.710131 NaN 68.073 NaN 10.8893 e 9.78 q 16407.5 NaN -38 NaN 108
116 108 Uzbekistan 0.710478 NaN 71.573 NaN 12.0121 NaN 11.525 NaN 6461.84 NaN 18 NaN 109
117 110 Libya 0.707568 NaN 72.724 NaN 12.7899 n 7.56 m 11684.7 r -16 NaN 111
118 111 Indonesia 0.706858 NaN 71.509 NaN 12.9156 NaN 7.97615 NaN 11255.8 NaN -14 NaN 111
119 111 Samoa 0.706771 NaN 73.187 NaN 12.52 e 10.5949 j 5884.84 NaN 18 NaN 110
120 113 South Africa 0.704937 NaN 63.857 NaN 13.6679 NaN 10.2406 NaN 11756.3 NaN -22 NaN 111
121 114 Bolivia (Plurinational State of) 0.702842 NaN 71.239 NaN 14.0131 s 9.01754 NaN 6849.2 NaN 8 NaN 114
122 115 Gabon 0.701625 NaN 66.187 NaN 12.9 n 8.32007 q 15794.1 NaN -40 NaN 114
123 116 Egypt 0.699727 NaN 71.825 NaN 13.1048 NaN 7.326 i 10743.8 NaN -16 NaN 116
124 NaN MEDIUM HUMAN DEVELOPMENT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
125 117 Marshall Islands 0.697589 NaN 73.861 f 12.3901 e 10.8857 e 4633.48 NaN 21 NaN 116
126 118 Viet Nam 0.692669 NaN 75.317 NaN 12.6923 l 8.2 i 6220.27 NaN 10 NaN 118
127 119 Palestine, State of 0.690017 NaN 73.895 NaN 12.843 NaN 9.10464 NaN 5313.83 NaN 15 NaN 119
128 120 Iraq 0.688784 NaN 70.454 NaN 11.1487 q 7.316 e 15365 NaN -44 NaN 120
129 121 Morocco 0.676439 NaN 76.453 NaN 13.0676 e 5.514 i 7479.59 NaN -3 NaN 121
130 122 Kyrgyzstan 0.674203 NaN 71.321 NaN 13.361 NaN 10.8804 l 3316.79 NaN 30 NaN 122
131 123 Guyana 0.670296 NaN 69.774 NaN 11.4726 e 8.46677 l 7615.42 NaN -7 NaN 123
132 124 El Salvador 0.666681 NaN 73.096 NaN 12.0356 NaN 6.94276 NaN 6973.46 NaN -3 NaN 124
133 125 Tajikistan 0.656039 NaN 70.879 NaN 11.4079 e 10.6717 q 3482.38 NaN 26 NaN 126
134 126 Cabo Verde 0.650693 NaN 72.782 NaN 11.8668 NaN 6.24457 NaN 6513.49 NaN -1 NaN 128
135 126 Guatemala 0.651035 NaN 74.063 NaN 10.6215 NaN 6.46789 NaN 7377.92 NaN -7 NaN 127
136 126 Nicaragua 0.651104 NaN 74.275 NaN 12.2069 s 6.796 i 4789.84 NaN 11 NaN 125
137 129 India 0.646901 NaN 69.416 NaN 12.3498 NaN 6.45213 e 6828.6 NaN -5 NaN 129
138 130 Namibia 0.645037 NaN 63.373 NaN 12.6334 q 6.94 i 9682.66 NaN -27 NaN 129
139 131 Timor-Leste 0.625887 NaN 69.26 NaN 12.3982 e 4.53788 q 7526.66 NaN -14 NaN 131
140 132 Honduras 0.622981 NaN 75.088 NaN 10.2079 NaN 6.59644 NaN 4258.35 NaN 7 NaN 133
141 132 Kiribati 0.623244 NaN 68.116 NaN 11.81 e 7.86553 j 3917.43 NaN 11 NaN 132
142 134 Bhutan 0.617299 NaN 71.46 NaN 12.128 e 3.13756 e 8609.12 NaN -23 NaN 134
143 135 Bangladesh 0.613679 NaN 72.32 NaN 11.2045 NaN 6.06183 NaN 4057.25 NaN 6 NaN 136
144 135 Micronesia (Federated States of) 0.614158 NaN 67.755 NaN 11.5495 j 7.71738 j 3700.1 NaN 10 NaN 135
145 137 Sao Tome and Principe 0.608556 NaN 70.17 NaN 12.6872 e 6.43676 e 3024.43 NaN 20 NaN 138
146 138 Congo 0.6085 NaN 64.29 NaN 11.6021 n 6.50227 m 5803.88 NaN -8 NaN 136
147 138 Eswatini (Kingdom of) 0.608082 NaN 59.401 NaN 11.3788 e 6.74781 l 9359.11 NaN -32 NaN 138
148 140 Lao People's Democratic Republic 0.604069 NaN 67.61 NaN 11.0572 NaN 5.202 i 6316.52 NaN -13 NaN 140
149 141 Vanuatu 0.596848 NaN 70.323 NaN 11.4169 e 6.84 l 2807.86 NaN 17 NaN 141
150 142 Ghana 0.59567 NaN 63.78 NaN 11.5192 NaN 7.184 i 4098.86 NaN -2 NaN 142
151 143 Zambia 0.591462 NaN 63.51 NaN 12.061 q 7.10418 q 3581.89 NaN 7 NaN 144
152 144 Equatorial Guinea 0.588422 NaN 58.402 NaN 9.2 n 5.55089 j 17795.5 NaN -80 NaN 143
153 145 Myanmar 0.584252 NaN 66.867 NaN 10.3154 NaN 4.9534 q 5763.94 NaN -13 NaN 146
154 146 Cambodia 0.581485 NaN 69.57 NaN 11.3368 e 4.844 i 3597.4 NaN 2 NaN 145
155 147 Kenya 0.578588 NaN 66.342 NaN 11.0642 e 6.564 i 3051.69 NaN 9 NaN 148
156 147 Nepal 0.579485 NaN 70.478 NaN 12.1952 NaN 4.856 i 2748.2 NaN 13 NaN 148
157 149 Angola 0.574488 NaN 60.782 NaN 11.7772 q 5.12546 q 5554.7 NaN -16 NaN 147
158 150 Cameroon 0.562725 NaN 58.921 NaN 12.7494 NaN 6.2917 l 3291.13 NaN 3 NaN 150
159 150 Zimbabwe 0.5631 NaN 61.195 NaN 10.4511 NaN 8.34225 e 2661.07 NaN 12 NaN 153
160 152 Pakistan 0.560446 NaN 67.114 NaN 8.46498 NaN 5.15894 NaN 5190.08 NaN -17 NaN 151
161 153 Solomon Islands 0.557341 NaN 72.835 NaN 10.2204 e 5.54073 q 2026.72 NaN 13 NaN 152
162 NaN LOW HUMAN DEVELOPMENT NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
163 154 Syrian Arab Republic 0.548903 NaN 71.779 NaN 8.8501 e 5.1 t 2725.19 r 7 NaN 154
164 155 Papua New Guinea 0.543076 NaN 64.263 NaN 9.99646 e 4.62 i 3685.8 NaN -9 NaN 155
165 156 Comoros 0.537826 NaN 64.118 NaN 11.2389 e 4.90853 q 2426.39 NaN 7 NaN 156
166 157 Rwanda 0.536015 NaN 68.7 NaN 11.1666 NaN 4.41715 e 1958.61 NaN 11 NaN 158
167 158 Nigeria 0.534136 NaN 54.332 NaN 9.74641 l 6.4619 q 5085.54 NaN -22 NaN 157
168 159 Tanzania (United Republic of) 0.528303 NaN 65.015 NaN 8.00989 NaN 6.008 i 2805.12 NaN 0 NaN 160
169 159 Uganda 0.528191 NaN 62.973 NaN 11.2426 e 6.09025 q 1752.21 NaN 11 NaN 160
170 161 Mauritania 0.527057 NaN 64.704 NaN 8.46557 NaN 4.614 i 3746.08 NaN -17 NaN 159
171 162 Madagascar 0.520697 NaN 66.681 NaN 10.4132 NaN 6.1 n 1403.92 NaN 19 NaN 162
172 163 Benin 0.519834 NaN 61.47 NaN 12.6145 NaN 3.774 m 2134.59 NaN 2 NaN 163
173 164 Lesotho 0.517955 NaN 53.705 NaN 10.7378 NaN 6.348 i 3243.84 NaN -9 NaN 164
174 165 Côte d'Ivoire 0.515747 NaN 57.422 NaN 9.62984 NaN 5.192 i 3589.41 NaN -16 NaN 165
175 166 Senegal 0.51377 NaN 67.665 NaN 8.97107 NaN 3.06893 e 3255.99 NaN -12 NaN 166
176 167 Togo 0.512733 NaN 60.76 NaN 12.5683 NaN 4.94993 q 1592.54 NaN 10 NaN 166
177 168 Sudan 0.507499 NaN 65.095 NaN 7.73987 e 3.72 i 3961.62 NaN -26 NaN 168
178 169 Haiti 0.502731 NaN 63.66 NaN 9.5 n 5.44169 q 1664.89 NaN 6 NaN 169
179 170 Afghanistan 0.49596 NaN 64.486 NaN 10.1389 NaN 3.93 i 1745.67 NaN 1 NaN 170
180 171 Djibouti 0.495433 NaN 66.582 NaN 6.4845 e 4 j 3600.71 u -24 NaN 171
181 172 Malawi 0.4854 NaN 63.798 NaN 10.9528 q 4.63 i 1159.12 NaN 11 NaN 172
182 173 Ethiopia 0.46985 NaN 66.24 NaN 8.71019 e 2.79544 q 1781.76 NaN -4 NaN 173
183 174 Gambia 0.465725 NaN 61.735 NaN 9.4821 e 3.666 q 1489.57 NaN 4 NaN 178
184 174 Guinea 0.465515 NaN 61.185 NaN 9.00675 e 2.70739 q 2211 NaN -10 NaN 175
185 176 Liberia 0.46472 NaN 63.73 NaN 9.57967 e 4.668 i 1040.09 NaN 9 NaN 173
186 177 Yemen 0.462717 NaN 66.096 NaN 8.66439 e 3.2 m 1433.3 r 3 NaN 175
187 178 Guinea-Bissau 0.461406 NaN 58.003 NaN 10.5 n 3.3035 l 1593.18 NaN -2 NaN 177
188 179 Congo (Democratic Republic of the) 0.45873 NaN 60.368 NaN 9.70259 e 6.75948 NaN 800.021 NaN 8 NaN 179
189 180 Mozambique 0.446045 NaN 60.163 NaN 9.74711 NaN 3.5409 e 1153.7 NaN 4 NaN 180
190 181 Sierra Leone 0.438465 NaN 54.309 NaN 10.1759 e 3.6 i 1381.3 NaN 1 NaN 181
191 182 Burkina Faso 0.433547 NaN 61.174 NaN 8.90646 NaN 1.58557 q 1705.49 NaN -8 NaN 183
192 182 Eritrea 0.43358 NaN 65.941 NaN 5.00528 NaN 3.9 n 1707.71 u -9 NaN 182
193 184 Mali 0.427206 NaN 58.893 NaN 7.60336 NaN 2.35295 l 1965.39 NaN -17 NaN 184
194 185 Burundi 0.422882 NaN 61.247 NaN 11.3046 NaN 3.12437 q 659.732 NaN 4 NaN 185
195 186 South Sudan 0.41277 NaN 57.604 NaN 5.00038 e 4.84913 NaN 1455.23 u -7 NaN 186
196 187 Chad 0.401176 NaN 53.977 NaN 7.46536 e 2.4095 q 1715.57 NaN -15 NaN 187
197 188 Central African Republic 0.380662 NaN 52.805 NaN 7.56836 e 4.282 i 776.676 NaN 0 NaN 188
198 189 Niger 0.376591 NaN 62.024 NaN 6.47145 NaN 2.02905 e 912.042 NaN -3 NaN 189
199 NaN OTHER COUNTRIES OR TERRITORIES NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
200 .. Korea (Democratic People's Rep. of) .. NaN 72.095 NaN 10.8406 e .. NaN .. NaN .. NaN ..
201 .. Monaco .. NaN .. NaN .. NaN .. NaN .. NaN .. NaN ..
202 .. Nauru .. NaN .. NaN 11.2594 e .. NaN 17312.6 NaN .. NaN ..
203 .. San Marino .. NaN .. NaN 15.1112 NaN .. NaN .. NaN .. NaN ..
204 .. Somalia .. NaN 57.068 NaN .. NaN .. NaN .. NaN .. NaN ..
205 .. Tuvalu .. NaN .. NaN 12.3089 NaN .. NaN 5408.95 NaN .. NaN ..
206 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
207 NaN Human development groups NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
208 NaN Very high human development 0.891786 NaN 79.505 NaN 16.3618 NaN 12.0436 NaN 40111.6 NaN — NaN —
209 NaN High human development 0.749566 NaN 75.145 NaN 13.8382 NaN 8.30443 NaN 14403.3 NaN — NaN —
210 NaN Medium human development 0.633837 NaN 69.3348 NaN 11.6952 NaN 6.37351 NaN 6239.97 NaN — NaN —
211 NaN Low human development 0.507037 NaN 61.3145 NaN 9.30426 NaN 4.77624 NaN 2580.74 NaN — NaN —
212 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
213 NaN Developing countries 0.685828 NaN 71.0868 NaN 12.181 NaN 7.38132 NaN 10476.2 NaN — NaN —
214 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
215 NaN Regions NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
216 NaN Arab States 0.703107 NaN 71.8994 NaN 12.005 NaN 7.1439 NaN 15721.3 NaN — NaN —
217 NaN East Asia and the Pacific 0.740919 NaN 75.2575 NaN 13.3805 NaN 7.91966 NaN 14611.4 NaN — NaN —
218 NaN Europe and Central Asia 0.778656 NaN 74.2085 NaN 14.5663 NaN 10.1536 NaN 15498.3 NaN — NaN —
219 NaN Latin America and the Caribbean 0.758574 NaN 75.4161 NaN 14.4565 NaN 8.57275 NaN 13857.3 NaN — NaN —
220 NaN South Asia 0.642274 NaN 69.7155 NaN 11.795 NaN 6.47856 NaN 6794.16 NaN — NaN —
221 NaN Sub-Saharan Africa 0.540527 NaN 61.179 NaN 9.96504 NaN 5.68506 NaN 3443.19 NaN — NaN —
222 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
223 NaN Least developed countries 0.528475 NaN 65.0398 NaN 9.77131 NaN 4.79609 NaN 2629.77 NaN — NaN —
224 NaN Small island developing states 0.722653 NaN 71.8426 NaN 12.2007 NaN 8.64299 NaN 15552.7 NaN — NaN —
225 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
226 NaN Organisation for Economic Co-operation and Dev... 0.895397 NaN 80.3687 NaN 16.2818 NaN 11.988 NaN 40614.6 NaN — NaN —
227 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
228 NaN World 0.731365 NaN 72.581 NaN 12.6665 NaN 8.43365 NaN 15745.4 NaN — NaN —
229 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
230 NaN NOTES NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
231 NaN a. Data refer to 2018 or the most recent year ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
232 NaN b. In calculating the HDI value, expected year... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
233 NaN c. Based on data from OECD (2018). NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
234 NaN d. In calculating the HDI value, GNI per capit... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
235 NaN e. Updated by HDRO based on data from UNESCO I... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
236 NaN f. Value from UNDESA (2011). NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
237 NaN g. Imputed mean years of schooling for Austria. NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
238 NaN h. Estimated using the PPP rate and projected ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
239 NaN i. Based on Barro and Lee (2018). NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
240 NaN j. Based on data from the national statistical... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
241 NaN k. Estimated using the purchasing power parity... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
242 NaN l. Updated by HDRO based on data from United N... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
243 NaN m. Updated by HDRO using Barro and Lee (2018) ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
244 NaN n. Based on cross-country regression. NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
245 NaN o. Based on a cross-country regression and the... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
246 NaN p. HDRO estimate based on data from World Bank... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
247 NaN q. Updated by HDRO based on data from ICF Macr... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
248 NaN r. HDRO estimate based on data from World Bank... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
249 NaN s. Updated by HDRO based on data from SEDLAS a... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
250 NaN t. Updated by HDRO based on Syrian Center for ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
251 NaN u. HDRO estimate based on data from World bank... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
252 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
253 NaN DEFINITIONS NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
254 NaN Human Development Index (HDI): A composite ind... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
255 NaN Life expectancy at birth: Number of years a ne... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
256 NaN Expected years of schooling: Number of years o... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
257 NaN Mean years of schooling: Average number of yea... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
258 NaN Gross national income (GNI) per capita: Aggreg... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
259 NaN GNI per capita rank minus HDI rank: Difference... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
260 NaN HDI rank for 2017: Ranking by HDI value for 20... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
261 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
262 NaN MAIN DATA SOURCES NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
263 NaN Columns 1 and 7: HDRO calculations based on da... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
264 NaN Column 2: UNDESA (2019b). NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
265 NaN Column 3: UNESCO Institute for Statistics (201... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
266 NaN Column 4: UNESCO Institute for Statistics (201... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
267 NaN Column 5: World Bank (2019a), IMF (2019) and U... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
268 NaN Column 6: Calculated based on data in columns ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Why doesn't a single database use standardised country codes?????? This is the same thing as before, only this time I was too lazy to type out the entire mapping so I just used pycountry's fuzzy_search instead. The country codes and names have been standardised, so it should be the same as that of the dfs before it.

In [23]:
hdidf = df.iloc[7:199, [1, 2]].reset_index(drop=True)
hdidf.columns = ["Country", "HDI"]
hdidf.Country.replace(
    {"Congo (Democratic Republic of the)": "Congo, the Democratic Republic of the"},
    inplace=True,
)
hdidf.insert(0, "CountryCode", hdidf.Country.apply(get_country_code))
not_matched = hdidf[hdidf.CountryCode.isna()]
hdidf = hdidf[~hdidf.CountryCode.isna()]
display(hdidf)
not_matched = not_matched[~not_matched.Country.str.contains("DEVELOPMENT")]
not_matched.Country = not_matched.Country.str.split(",|\(", expand=True)[0]
not_matched.CountryCode = not_matched.Country.apply(
    lambda x: pycountry.countries.search_fuzzy(x)[0].alpha_3
)
hdidf = pd.concat([hdidf, not_matched])
hdidf.Country = hdidf.CountryCode.apply(get_country_name)
hdidf.sort_values(by="CountryCode", inplace=True)
hdidf.reset_index(inplace=True, drop=True)
display(not_matched)
display(hdidf)
CountryCode Country HDI
0 NOR Norway 0.953688
1 CHE Switzerland 0.945936
2 IRL Ireland 0.942473
3 DEU Germany 0.938785
5 AUS Australia 0.938379
... ... ... ...
187 BDI Burundi 0.422882
188 SSD South Sudan 0.41277
189 TCD Chad 0.401176
190 CAF Central African Republic 0.380662
191 NER Niger 0.376591

180 rows × 3 columns

CountryCode Country HDI
4 HKG Hong Kong 0.938809
22 PRK Korea 0.905832
65 IRN Iran 0.797483
97 VEN Venezuela 0.725773
107 MDA Moldova 0.711452
114 BOL Bolivia 0.702842
137 FSM Micronesia 0.614158
140 SWZ Eswatini 0.608082
161 TZA Tanzania 0.528303
CountryCode Country HDI
0 AFG Afghanistan 0.49596
1 AGO Angola 0.574488
2 ALB Albania 0.791406
3 AND Andorra 0.856781
4 ARE United Arab Emirates 0.866438
... ... ... ...
184 WSM Samoa 0.706771
185 YEM Yemen 0.462717
186 ZAF South Africa 0.704937
187 ZMB Zambia 0.591462
188 ZWE Zimbabwe 0.5631

189 rows × 3 columns

Data cleaning is now complete, and the datasets are as follows:

In [24]:
display(HTML("<h5>Mortality Causes DataFrame: </h5>"))
display(mdf)
display(HTML("<h5>BMI DataFrame: </h5>"))
display(bmidf)
display(HTML("<h5>CMR DataFrame: </h5>"))
display(cmrdf)
display(HTML("<h5>HDI DataFrame: </h5>"))
display(hdidf)
Mortality Causes DataFrame:
CountryCode Country Cause CauseName 1979 1980 1981 1982 1983 1984 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 ALB Albania 1 Certain infectious and parasitic diseases NaN NaN NaN NaN NaN NaN ... 2.0 2.0 1.1 1.2 NaN NaN NaN NaN NaN NaN
1 ALB Albania 2 Neoplasms NaN NaN NaN NaN NaN NaN ... 74.2 80.5 74.7 56.4 NaN NaN NaN NaN NaN NaN
2 ALB Albania 5 Mental and behavioural disorders NaN NaN NaN NaN NaN NaN ... 1.8 1.6 1.5 1.7 NaN NaN NaN NaN NaN NaN
3 ALB Albania 6 Diseases of the nervous system NaN NaN NaN NaN NaN NaN ... 12.7 4.8 4.8 7.3 NaN NaN NaN NaN NaN NaN
4 ALB Albania 7 Diseases of the eye and adnexa NaN NaN NaN NaN NaN NaN ... 0.1 0.1 0.1 0.1 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1776 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue NaN 3.0 NaN NaN NaN NaN ... 2.5 1.4 0.7 4.1 1.8 0.6 NaN NaN 1.0 NaN
1777 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... NaN 3.2 NaN NaN NaN NaN ... 3.6 3.0 1.7 2.9 2.4 4.3 NaN NaN 1.6 NaN
1778 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system NaN 13.9 NaN NaN NaN NaN ... 8.6 6.8 7.5 13.8 7.1 7.7 NaN NaN 8.3 NaN
1779 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium NaN 14.7 NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 NaN
1780 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... NaN 8.7 NaN NaN NaN NaN ... 7.1 4.0 1.1 4.1 3.5 2.4 NaN NaN 0.8 NaN

1781 rows × 42 columns

BMI DataFrame:
CountryCode Country 1979 1980 1981 1982 1983 1984 1985 1986 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 AFG Afghanistan 19.4 19.5 19.6 19.8 19.9 20.0 20.1 20.2 ... 22.4 22.5 22.6 22.7 22.8 22.9 23.0 23.2 23.3 23.4
1 ALB Albania 24.0 24.1 24.1 24.2 24.2 24.3 24.3 24.4 ... 25.8 25.9 26.0 26.1 26.2 26.3 26.4 26.5 26.6 26.7
2 DZA Algeria 22.2 22.3 22.4 22.5 22.6 22.7 22.8 22.9 ... 24.8 24.9 25.0 25.1 25.1 25.2 25.3 25.4 25.5 25.5
3 AND Andorra 25.7 25.8 25.9 25.9 26.0 26.0 26.1 26.2 ... 26.8 26.8 26.8 26.8 26.8 26.8 26.8 26.7 26.7 26.7
4 AGO Angola 19.3 19.4 19.5 19.6 19.8 19.9 20.0 20.1 ... 22.5 22.6 22.7 22.8 22.9 23.0 23.1 23.2 23.2 23.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
181 VEN Venezuela, Bolivarian Republic of 24.4 24.5 24.6 24.7 24.8 24.9 25.0 25.1 ... 26.5 26.5 26.6 26.6 26.6 26.6 26.6 26.6 26.7 26.7
182 VNM Viet Nam 18.4 18.5 18.6 18.6 18.7 18.8 18.8 18.9 ... 20.7 20.8 20.9 21.0 21.2 21.3 21.5 21.6 21.7 21.9
183 YEM Yemen 20.1 20.2 20.3 20.4 20.5 20.6 20.7 20.8 ... 23.0 23.1 23.2 23.3 23.4 23.4 23.5 23.6 23.7 23.8
184 ZMB Zambia 20.0 20.1 20.2 20.4 20.5 20.5 20.6 20.7 ... 22.0 22.0 22.1 22.2 22.2 22.3 22.4 22.4 22.5 22.6
185 ZWE Zimbabwe 22.3 22.3 22.4 22.4 22.5 22.6 22.6 22.7 ... 23.5 23.6 23.6 23.6 23.7 23.7 23.7 23.8 23.8 23.8

186 rows × 40 columns

CMR DataFrame:
CountryCode Country 1979 1980 1981 1982 1983 1984 1985 1986 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 AFG Afghanistan 165.257379 161.028132 156.763436 152.481639 148.354144 144.321605 140.257604 136.247094 ... 71.660270 69.035406 66.525823 64.100567 61.726441 59.454642 57.225151 55.160932 53.169807 51.319141
1 ALB Albania 71.181438 66.155752 61.475555 57.115470 53.214275 49.681820 46.508865 43.680472 ... 15.292477 14.077470 12.914518 11.828097 10.825056 9.963018 9.284420 8.805975 8.509460 8.378718
2 DZA Algeria 108.045384 101.794747 94.546163 86.274384 77.023313 67.696731 59.545794 53.206759 ... 26.437225 25.349898 24.381526 23.562945 22.933794 22.471525 22.137349 21.873076 21.612694 21.280396
3 AND Andorra NaN NaN NaN NaN NaN NaN 10.106871 9.893521 ... 4.916380 4.680900 4.459207 4.256321 4.043769 3.850386 3.665888 3.495399 3.340385 3.200425
4 AGO Angola NaN 140.238393 138.636313 137.023220 135.410888 134.155570 133.220981 132.510430 ... 90.490680 85.345245 80.484163 75.868650 71.420473 67.316445 63.715246 60.540468 57.816653 55.562227
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
190 VEN Venezuela, Bolivarian Republic of 36.571065 35.186548 33.864139 32.649864 31.545204 30.547752 29.618986 28.669964 ... 14.982381 14.745607 14.653889 14.666083 14.743388 14.887073 15.119070 15.431784 16.436529 21.044404
191 VNM Viet Nam 47.531900 46.631146 45.737060 44.938104 44.172720 43.422350 42.616956 41.720032 ... 19.067110 18.766418 18.495580 18.261223 18.038662 17.818400 17.617092 17.417050 17.201177 16.937983
192 YEM Yemen 147.133742 139.228105 131.385962 123.940729 117.077235 111.019952 105.715362 101.186844 ... 49.343347 46.984256 44.940469 43.634253 43.077472 42.964150 42.962961 42.986121 43.358838 43.310794
193 ZMB Zambia 95.096937 95.596715 96.032758 96.717562 97.831035 99.456249 101.373502 103.377671 ... 58.374573 55.919851 54.130195 52.323193 51.028241 49.923447 48.746112 47.472249 46.721240 45.789160
194 ZWE Zimbabwe 70.143121 68.574404 66.204777 63.317586 60.084513 56.855553 53.973998 51.651692 ... 56.202805 56.418056 55.866004 53.776518 51.631998 47.853394 45.864949 44.245706 42.868009 41.392057

195 rows × 40 columns

HDI DataFrame:
CountryCode Country HDI
0 AFG Afghanistan 0.49596
1 AGO Angola 0.574488
2 ALB Albania 0.791406
3 AND Andorra 0.856781
4 ARE United Arab Emirates 0.866438
... ... ... ...
184 WSM Samoa 0.706771
185 YEM Yemen 0.462717
186 ZAF South Africa 0.704937
187 ZMB Zambia 0.591462
188 ZWE Zimbabwe 0.5631

189 rows × 3 columns

Let us create some long versions of the DataFrames as well.

In [25]:
longmdf = pd.melt(
    mdf,
    id_vars=["CountryCode", "Country", "Cause", "CauseName"],
    var_name="Year",
    value_name="DeathsPer100k",
)
longbmidf = pd.melt(
    bmidf, id_vars=["CountryCode", "Country"], var_name="Year", value_name="BMI"
)
longcmrdf = pd.melt(
    cmrdf, id_vars=["CountryCode", "Country"], var_name="Year", value_name="CMR"
)
display(longmdf)
display(longbmidf)
display(longcmrdf)
CountryCode Country Cause CauseName Year DeathsPer100k
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN
1 ALB Albania 2 Neoplasms 1979 NaN
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN
3 ALB Albania 6 Diseases of the nervous system 1979 NaN
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN
... ... ... ... ... ... ...
67673 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue 2016 NaN
67674 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... 2016 NaN
67675 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system 2016 NaN
67676 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium 2016 NaN
67677 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... 2016 NaN

67678 rows × 6 columns

CountryCode Country Year BMI
0 AFG Afghanistan 1979 19.4
1 ALB Albania 1979 24.0
2 DZA Algeria 1979 22.2
3 AND Andorra 1979 25.7
4 AGO Angola 1979 19.3
... ... ... ... ...
7063 VEN Venezuela, Bolivarian Republic of 2016 26.7
7064 VNM Viet Nam 2016 21.9
7065 YEM Yemen 2016 23.8
7066 ZMB Zambia 2016 22.6
7067 ZWE Zimbabwe 2016 23.8

7068 rows × 4 columns

CountryCode Country Year CMR
0 AFG Afghanistan 1979 165.257379
1 ALB Albania 1979 71.181438
2 DZA Algeria 1979 108.045384
3 AND Andorra 1979 NaN
4 AGO Angola 1979 NaN
... ... ... ... ...
7405 VEN Venezuela, Bolivarian Republic of 2016 21.044404
7406 VNM Viet Nam 2016 16.937983
7407 YEM Yemen 2016 43.310794
7408 ZMB Zambia 2016 45.789160
7409 ZWE Zimbabwe 2016 41.392057

7410 rows × 4 columns

In general, the leading cause of mortality has shifted from infectious and parasitic diseases to diseases of the circulatory system to neoplasms as time progresses.

Circulatory system illness remain by far the most common cause of death with neoplasms trailing behind. The other causes are far below these two.

Countries that are outliers when it comes to causes of death generally remain outliers for at least a decade. This shows that in order to be an outlier, there must be something habitual that these countries cannot fix quickly.

Let us find the proportion of deaths caused by each Cause globally. Note that the total number of deaths per year changes, so we need to divide the deaths per 100,000 per Cause by total deaths per 100,000.

In [143]:
deathsPerYearByCause = mdf.groupby("CauseName").mean()
display(deathsPerYearByCause)
display(deathsPerYearByCause.sum())
dpybcNormal = (deathsPerYearByCause / deathsPerYearByCause.sum()).drop("Cause", axis=1)
dpybcNormal.T.plot(kind="line", figsize=(20, 10), colormap='gist_rainbow').legend(bbox_to_anchor=(1, 1))
dpybcNormal.T.plot(kind="line", figsize=(20, 10), colormap='gist_rainbow', ylim=(0, 0.1)).legend(
    bbox_to_anchor=(1, 1)
)
Cause 1979 1980 1981 1982 1983 1984 1985 1986 1987 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
CauseName
Certain infectious and parasitic diseases 1 32.208333 27.803922 26.175385 22.732258 21.658824 27.177358 22.111111 21.350000 22.373418 ... 21.112500 21.638947 20.055556 19.483673 19.370103 18.498913 17.543478 16.751136 16.226471 11.383333
Congenital malformations, deformations and chromosomal abnormalities 16 6.966667 7.135294 7.244615 7.243548 6.976471 6.639623 6.954167 6.683333 6.541772 ... 4.668750 4.562105 4.409091 4.544898 4.713402 4.552174 4.490217 4.523864 4.458824 4.416667
Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 3.153261 3.339560 2.896842 3.019149 2.903158 2.848352 2.657143 2.529545 2.680882 2.050000
Diseases of the circulatory system 9 316.672222 328.562745 343.403077 342.240323 308.686275 300.616981 345.008333 331.254167 325.879747 ... 240.070833 235.381053 219.078788 210.455102 203.438144 195.495652 184.958696 188.875000 188.208824 210.205556
Diseases of the digestive system 11 34.675000 38.401961 35.346154 34.383871 32.658824 34.226415 35.001389 32.743056 31.608861 ... 29.258333 29.497895 26.720202 26.885714 26.016495 26.045652 25.777174 25.890909 26.991176 28.500000
Diseases of the ear and mastoid process 8 0.113889 0.126531 0.111765 0.089583 0.097959 0.088235 0.136207 0.077586 0.120000 ... 0.032632 0.051064 0.030612 0.028866 0.030928 0.013043 0.043478 0.015909 0.060294 0.016667
Diseases of the eye and adnexa 7 0.008333 0.018367 0.025490 0.027083 0.057143 0.054902 0.043103 0.018966 0.007692 ... 0.007368 0.007447 0.023469 0.008247 0.015464 0.019565 0.016304 0.021591 0.019118 0.005556
Diseases of the genitourinary system 14 16.616667 16.170588 14.744615 14.685484 15.192157 14.401887 14.976389 14.119444 13.812658 ... 13.151042 13.034737 12.278788 12.623469 12.222680 12.272826 12.053261 11.906818 12.519118 10.777778
Diseases of the musculoskeletal system and connective tissue 13 2.300000 2.301961 2.067692 2.088710 2.552941 2.475472 2.129167 2.109722 1.810127 ... 2.097917 2.147368 2.037374 2.130612 2.009278 2.131522 2.233696 2.154545 2.366176 1.750000
Diseases of the nervous system 6 9.805556 11.232653 10.815686 10.931250 11.522449 11.086275 11.227586 10.912069 10.049231 ... 12.508421 12.694681 11.948980 12.490722 12.531959 13.236957 13.648913 13.946591 15.400000 13.916667
Diseases of the respiratory system 10 64.241667 69.078431 72.336923 68.637097 62.301961 62.445283 69.018056 63.590278 59.802532 ... 45.280208 45.087368 42.033333 41.297959 40.222680 38.736957 38.575000 38.542045 42.917647 34.994444
Diseases of the skin and subcutaneous tissue 12 1.588889 1.605882 1.296923 1.341935 1.870588 1.647170 1.420833 1.194444 1.505063 ... 1.687500 1.549474 1.571717 1.585714 1.452577 1.598913 1.408696 1.476136 1.658824 1.761111
Endocrine, nutritional and metabolic diseases 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 33.547826 33.397802 32.489474 31.621277 33.971579 32.250549 29.602198 28.917045 28.044118 28.611111
Mental and behavioural disorders 5 4.986111 4.821569 5.212308 4.556452 4.733333 4.796226 4.512500 4.969444 4.765823 ... 6.088542 6.221053 6.118182 6.392857 6.583505 7.000000 7.183696 7.229545 8.239706 7.994444
Neoplasms 2 125.641667 126.821569 124.909231 126.882258 129.123529 126.988679 130.644444 129.647222 128.015190 ... 115.535417 116.133684 112.129293 110.590816 109.625773 111.945652 109.398913 109.404545 112.839706 129.877778
Pregnancy, childbirth and the puerperium 15 2.911111 2.725490 2.200000 2.079032 1.776471 2.075472 1.972222 1.670833 1.634177 ... 0.819792 0.867368 0.803030 0.901020 0.798969 0.755435 0.711957 0.611364 0.545588 0.561111

16 rows × 39 columns

Cause    136.000000
1979     618.736111
1980     636.806963
1981     645.889864
1982     637.918884
            ...    
2012     467.402162
2013     450.302819
2014     452.796591
2015     463.176471
2016     486.822222
Length: 39, dtype: float64
Out[143]:
<matplotlib.legend.Legend at 0x23544f437b8>

The cause with the highest fraction of deaths by far is that of circulatory diseases. This has been decreasing since the 1990s but have recently risen.

The second is neoplasms, which is essentially cancer. This has been gradually increasing since the 1980s, but have recently spiked.

At around 1984, cardiovascular diseases suddenly took a dip while then number 2, 4, 5 (neoplasms, digestive diseases, infectious diseases) spiked.

The third is respiratory diseases. Respiratory diseases in general have been decreasing as the world becomes more developed.

Endocrine, nutritional and metabolic diseases have been on the rise quite rapidly since 1994.

Let us make an animated map of cardiovascular disease deaths across time. The data has to be interpolated based on last recorded number of deaths. This would prevent the countries' color from snapping in between an actual color and gray.

In [27]:
import plotly.express as px
from plotly.offline import init_notebook_mode

init_notebook_mode(connected=True)
mdf.loc[:, "1979":"2016"].apply(pd.to_numeric, errors="coerce")
mdf["1979"] = mdf["1979"].astype(float)
display(mdf)

# with pd.option_context('display.max_rows', None):
mdfInterp = pd.concat(
    [mdf.iloc[:, :4], mdf.loc[:, "1979":"2016"].interpolate(method="pad", axis=1)],
    axis=1,
)
display(mdfInterp)
CountryCode Country Cause CauseName 1979 1980 1981 1982 1983 1984 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 ALB Albania 1 Certain infectious and parasitic diseases NaN NaN NaN NaN NaN NaN ... 2.0 2.0 1.1 1.2 NaN NaN NaN NaN NaN NaN
1 ALB Albania 2 Neoplasms NaN NaN NaN NaN NaN NaN ... 74.2 80.5 74.7 56.4 NaN NaN NaN NaN NaN NaN
2 ALB Albania 5 Mental and behavioural disorders NaN NaN NaN NaN NaN NaN ... 1.8 1.6 1.5 1.7 NaN NaN NaN NaN NaN NaN
3 ALB Albania 6 Diseases of the nervous system NaN NaN NaN NaN NaN NaN ... 12.7 4.8 4.8 7.3 NaN NaN NaN NaN NaN NaN
4 ALB Albania 7 Diseases of the eye and adnexa NaN NaN NaN NaN NaN NaN ... 0.1 0.1 0.1 0.1 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1776 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue NaN 3.0 NaN NaN NaN NaN ... 2.5 1.4 0.7 4.1 1.8 0.6 NaN NaN 1.0 NaN
1777 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... NaN 3.2 NaN NaN NaN NaN ... 3.6 3.0 1.7 2.9 2.4 4.3 NaN NaN 1.6 NaN
1778 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system NaN 13.9 NaN NaN NaN NaN ... 8.6 6.8 7.5 13.8 7.1 7.7 NaN NaN 8.3 NaN
1779 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium NaN 14.7 NaN NaN NaN NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN 0.0 NaN
1780 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... NaN 8.7 NaN NaN NaN NaN ... 7.1 4.0 1.1 4.1 3.5 2.4 NaN NaN 0.8 NaN

1781 rows × 42 columns

CountryCode Country Cause CauseName 1979 1980 1981 1982 1983 1984 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0 ALB Albania 1 Certain infectious and parasitic diseases NaN NaN NaN NaN NaN NaN ... 2.0 2.0 1.1 1.2 1.2 1.2 1.2 1.2 1.2 1.2
1 ALB Albania 2 Neoplasms NaN NaN NaN NaN NaN NaN ... 74.2 80.5 74.7 56.4 56.4 56.4 56.4 56.4 56.4 56.4
2 ALB Albania 5 Mental and behavioural disorders NaN NaN NaN NaN NaN NaN ... 1.8 1.6 1.5 1.7 1.7 1.7 1.7 1.7 1.7 1.7
3 ALB Albania 6 Diseases of the nervous system NaN NaN NaN NaN NaN NaN ... 12.7 4.8 4.8 7.3 7.3 7.3 7.3 7.3 7.3 7.3
4 ALB Albania 7 Diseases of the eye and adnexa NaN NaN NaN NaN NaN NaN ... 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1776 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue NaN 3.0 3.0 3.0 3.0 3.0 ... 2.5 1.4 0.7 4.1 1.8 0.6 0.6 0.6 1.0 1.0
1777 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... NaN 3.2 3.2 3.2 3.2 3.2 ... 3.6 3.0 1.7 2.9 2.4 4.3 4.3 4.3 1.6 1.6
1778 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system NaN 13.9 13.9 13.9 13.9 13.9 ... 8.6 6.8 7.5 13.8 7.1 7.7 7.7 7.7 8.3 8.3
1779 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium NaN 14.7 14.7 14.7 14.7 14.7 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1780 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... NaN 8.7 8.7 8.7 8.7 8.7 ... 7.1 4.0 1.1 4.1 3.5 2.4 2.4 2.4 0.8 0.8

1781 rows × 42 columns

We will now convert the DataFrame to a "long" format from the current "wide" format by using the method melt. This prepares the data for the choropleth visualization.

In [28]:
forMap = pd.melt(
    mdfInterp, id_vars=["CountryCode", "Country", "Cause", "CauseName"], var_name="Year"
)
forMap.reset_index(inplace=True, drop=True)
forMap.rename(columns={"value": "DeathsPer100k"}, inplace=True)
forMap.DeathsPer100k = forMap.DeathsPer100k.astype(float)

display(forMap)
CountryCode Country Cause CauseName Year DeathsPer100k
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN
1 ALB Albania 2 Neoplasms 1979 NaN
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN
3 ALB Albania 6 Diseases of the nervous system 1979 NaN
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN
... ... ... ... ... ... ...
67673 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue 2016 1.0
67674 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... 2016 1.6
67675 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system 2016 8.3
67676 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium 2016 0.0
67677 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... 2016 0.8

67678 rows × 6 columns

We will now divide DeathsPer100k by the total number of deaths in that country, in that year. This is because different countries have different total rates of death, and this question aims to analyse the proportion of deaths due to a particular cause globally across the years.

In [29]:
totalDeathsSeries = forMap.groupby(["CountryCode", "Year"]).DeathsPer100k.sum()


def calculate_death_prop(row):
    if row.DeathsPer100k != np.nan:
        totalDeaths = totalDeathsSeries.loc[(row.CountryCode, row.Year)]
        row.DeathsPer100k = row.DeathsPer100k / totalDeaths
    return row


display(forMap)
forMapCalibed = forMap.apply(
    calculate_death_prop, axis=1
)  # re-calibrated DataFrame for map creation
forMapCalibed.rename(columns={"DeathsPer100k": "DeathsProp"}, inplace=True)
display(forMapCalibed)
CountryCode Country Cause CauseName Year DeathsPer100k
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN
1 ALB Albania 2 Neoplasms 1979 NaN
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN
3 ALB Albania 6 Diseases of the nervous system 1979 NaN
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN
... ... ... ... ... ... ...
67673 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue 2016 1.0
67674 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... 2016 1.6
67675 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system 2016 8.3
67676 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium 2016 0.0
67677 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... 2016 0.8

67678 rows × 6 columns

CountryCode Country Cause CauseName Year DeathsProp
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN
1 ALB Albania 2 Neoplasms 1979 NaN
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN
3 ALB Albania 6 Diseases of the nervous system 1979 NaN
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN
... ... ... ... ... ... ...
67673 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue 2016 0.003338
67674 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... 2016 0.005340
67675 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system 2016 0.027704
67676 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium 2016 0.000000
67677 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... 2016 0.002670

67678 rows × 6 columns

We will now normalize the data using the min-max method.

In [30]:
forMapCalibed.DeathsProp /= (
    forMapCalibed.DeathsProp.max() - forMapCalibed.DeathsProp.min()
)
forMapCalibed
Out[30]:
CountryCode Country Cause CauseName Year DeathsProp
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN
1 ALB Albania 2 Neoplasms 1979 NaN
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN
3 ALB Albania 6 Diseases of the nervous system 1979 NaN
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN
... ... ... ... ... ... ...
67673 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue 2016 0.004068
67674 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... 2016 0.006509
67675 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system 2016 0.033764
67676 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium 2016 0.000000
67677 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... 2016 0.003254

67678 rows × 6 columns

As one can see, the DeathsPer100k have been successfully converted to DeathsProp and the normalization has also done its job. For instance, the deaths figure for CountryCode=VIR and Cause=14 has been converted from 8.3 originally to 0.027 after calibration to 0.033 after normalization. Now we can make our animated choropleth maps.

Note: Please see individual .html files for the results.

In [31]:
for cause in range(1, 17):
    forMapWithCause = forMapCalibed[forMapCalibed.Cause == cause]
    fig = px.choropleth(
        forMapWithCause,  # Input Dataframe
        locations="CountryCode",  # identify country code column
        color="DeathsProp",  # identify representing column
        hover_name="Country",  # identify hover name
        animation_frame="Year",  # identify date column
        projection="natural earth",  # select projection
        color_continuous_scale="Viridis",  # select preferred color scale
        range_color=[0, forMapWithCause.DeathsProp.max()],  # select range of dataset
        title=str(cause) + ". " + disease_codes.loc[cause, "name"],
    )
    fig.write_html("q1choropleth" + format(cause, "02d") + ".html", full_html=False)

I will go over some graphs that I find to be interesting.

Graph 9, diseases of the circulatory system: You can see a clear trend of the American countries gradually having circulatory system diseases become a smaller part of their causes of death. Most of Europe is exhibiting the same trend as well, with Eastern Bloc nations lagging behind the First World by about 30 years. Kazakhstan is the exception here, as it has reduced its circulatory disease deaths proportion significant in recent years. Egypt is roughly following the same trend as the Eastern Bloc.

Surprisingly, Mexico and Guatemala has since 1979 shown a very low proportion of deaths due to this cause. This breaks the trend of more developed countries having a lower proportion.

Graph 2, neoplasms: Compared to graph 9, the trend here is much less clear. In recent years, many developed countries have started to have a high proportion of this cause of death. In contrast to cardiovascular diseases, the countries' colors also tend to fluctuate a bit more between light and dark, indicating that the trend is much less obvious.

Graph 5, mental and behavioural disorders: First World countries have been experiencing a higher number of deaths due to this cause in the past 20 years, starting with Finland. I think this may be due to mental health becoming better documented in developed countries. Moreover, in many developing countries, parents would discreetly kill their child if they perceive something to be wrong with their minds early on, which makes this statistic a tad unreliable.

Graph 11, digestive system: Mexico and Guatemala has consistently shown a higher proportion of deaths due to this cause. Egypt and Eastern Bloc countries have also been especially vulnerable in recent years.

Graph 10, respiratory system: Guatemala is exceptionally high here. Most of South America as well as Kazakhstan have seen increases in this cause in the past few years.

Now we shall plot the leading cause of death globally over time. This will be similar to the choropleth plotted above but this time the data is discrete rather than continous. We have to first make a DataFrame of leading cause of death and number of deaths (for easy reference) per year by country.

Note from the future Steve:

Plotly is open-source so expect bugs. The following chunk of code creates placeholder datapoints. Without them, the only shown leading causes of deaths would the ones during the first year (1979). In the following years, if another leading cause of death appears, the choropleth map will simply not show it. This persists until the original leading causes are no longer present in the current year.

This is clearly a bug on their part, so I will take the liberty of using some comparatively inefficient for loops to sort out the issue.

In [32]:
leadingCauses = list(
    pd.DataFrame(
        forMap.replace(np.nan, 0)
        .groupby(["Country", "Year"])
        .apply(
            lambda x: x.rename(
                {"Country": "ColumnCountry", "Year": "ColumnYear"}, axis=1
            ).loc[x.DeathsPer100k.idxmax()]
        )
    ).Cause.unique()
)
vat = {
    "CountryCode": [],
    "Country": [],
    "Cause": [],
    "CauseName": [],
    "Year": [],
    "DeathsPer100k": [],
}
for x in range(1979, 2017):
    for y in leadingCauses:
        vat["CountryCode"].append("ph" + str(y))
        vat["Country"].append("ph" + str(y))
        vat["Cause"].append(y)
        vat["CauseName"].append(disease_codes.loc[y, "name"])
        vat["Year"].append(str(x))
        vat["DeathsPer100k"].append(0)
vatdf = pd.DataFrame(vat)
forDMap = pd.concat([forMap, vatdf], ignore_index=True)
In [33]:
display(forDMap)
maxCauseOfDeath = pd.DataFrame(
    forDMap.replace(np.nan, 0)
    .groupby(["Country", "Year"])
    .apply(
        lambda x: x.rename(
            {"Country": "ColumnCountry", "Year": "ColumnYear"}, axis=1
        ).loc[x.DeathsPer100k.idxmax()]
    )
)
CountryCode Country Cause CauseName Year DeathsPer100k
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN
1 ALB Albania 2 Neoplasms 1979 NaN
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN
3 ALB Albania 6 Diseases of the nervous system 1979 NaN
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN
... ... ... ... ... ... ...
67863 ph1 ph1 1 Certain infectious and parasitic diseases 2016 0.0
67864 ph9 ph9 9 Diseases of the circulatory system 2016 0.0
67865 ph2 ph2 2 Neoplasms 2016 0.0
67866 ph10 ph10 10 Diseases of the respiratory system 2016 0.0
67867 ph7 ph7 7 Diseases of the eye and adnexa 2016 0.0

67868 rows × 6 columns

Now, we can plot a discrete choropleth map.

In [34]:
maxCauseOfDeath.ColumnYear = maxCauseOfDeath.ColumnYear.astype(int)
maxCauseOfDeath.Cause = maxCauseOfDeath.Cause.astype(str)
fig = px.choropleth(
    maxCauseOfDeath,
    color="CauseName",
    locations="CountryCode",
    hover_data=["ColumnCountry", "DeathsPer100k"],
    animation_frame="ColumnYear",
)
fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="left", x=0.9))
fig.write_html("q1discretechoro.html")

With the exception of Guatemala, which has consistently had respiratory illnesses as it's primary cause of death, the rest of the nations experience the causes of death roughly in this order as they become more developed: infectious, circulatory and then neoplasms.

I believe the reasons are as follows. Infectious diseases used to be commonplace as the healthcare infrastructure of many Second and Third World countries were unsatisfactory. Once this problem has been resolved, the two remaining are both degenerative diseases. Circulatory diseases are partially due to lifestyle decisions such as diet and exercise, and so as the citizens of very developed countries educate themselves in these regards, proportion of deaths due to circulatory diseases have decreased. Lastly, there is still no cure for cancer, so neoplasms is the final major cause of death that even exceptionally developed countries are struggling with today.

Let us also plot some boxplots to identify the outlier countries of each cause of death each year.

In [35]:
topCauses = (
    mdf.groupby("Cause").sum().sum(axis=1).sort_values(ascending=False).index[:5]
)
forBoxplot = longmdf.copy()
forBoxplot.reset_index(inplace=True, drop=True)
forBoxplot.DeathsPer100k = forBoxplot.DeathsPer100k.astype(float)

# This is similar to the calibrated data used to create choropleth maps except there is no interpolation.
totalDeathsSeries = forBoxplot.groupby(["CountryCode", "Year"]).DeathsPer100k.sum()
display(forBoxplot)
forBoxplotCalibed = forBoxplot.apply(
    calculate_death_prop, axis=1
)  # re-calibrated DataFrame for Boxplot creation
forBoxplotCalibed.rename(columns={"DeathsPer100k": "DeathsProp"}, inplace=True)
forBoxplotCalibed.DeathsProp /= (
    forBoxplotCalibed.DeathsProp.max() - forBoxplotCalibed.DeathsProp.min()
)
display(forBoxplotCalibed)
CountryCode Country Cause CauseName Year DeathsPer100k
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN
1 ALB Albania 2 Neoplasms 1979 NaN
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN
3 ALB Albania 6 Diseases of the nervous system 1979 NaN
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN
... ... ... ... ... ... ...
67673 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue 2016 NaN
67674 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... 2016 NaN
67675 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system 2016 NaN
67676 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium 2016 NaN
67677 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... 2016 NaN

67678 rows × 6 columns

CountryCode Country Cause CauseName Year DeathsProp
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN
1 ALB Albania 2 Neoplasms 1979 NaN
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN
3 ALB Albania 6 Diseases of the nervous system 1979 NaN
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN
... ... ... ... ... ... ...
67673 VIR Virgin Islands, U.S. 12 Diseases of the skin and subcutaneous tissue 2016 NaN
67674 VIR Virgin Islands, U.S. 13 Diseases of the musculoskeletal system and con... 2016 NaN
67675 VIR Virgin Islands, U.S. 14 Diseases of the genitourinary system 2016 NaN
67676 VIR Virgin Islands, U.S. 15 Pregnancy, childbirth and the puerperium 2016 NaN
67677 VIR Virgin Islands, U.S. 16 Congenital malformations, deformations and chr... 2016 NaN

67678 rows × 6 columns

Now, onto plotting the boxplots. I will be using a whisker size of 2 IQR instead of the conventional 1.5 IQR because this dataset has high kurtosis, and too many outliers makes the plot very cluttered. I will also be labelling the outliers so as to get a clear indication of which nations have been outlying throughout the years.

In [36]:
fig, axs = plt.subplots(len(topCauses), figsize=(20, 100))
forBoxplotCondensed = forBoxplotCalibed[forBoxplotCalibed.Cause.isin(topCauses)]
for x in range(len(topCauses)):
    # Label outliers:
    # dptc stands for DeathsPerTopCause
    dptc = forBoxplotCondensed[forBoxplotCondensed.Cause == topCauses[x]]
    q1 = dptc.groupby(dptc.Year).quantile(0.25)["DeathsProp"].to_numpy()
    q3 = dptc.groupby(dptc.Year).quantile(0.75)["DeathsProp"].to_numpy()
    outlier_top_lim = q3 + 2 * (q3 - q1)
    outlier_bottom_lim = q1 - 2 * (q3 - q1)
    #     print(outlier_top_lim)
    for row in dptc.itertuples():
        year = int(row.Year) - 1979
        #         print(type(year), year)
        val = row.DeathsProp
        if val > outlier_top_lim[year] or val < outlier_bottom_lim[year]:
            axs[x].text(year + 0.1, val, row.CountryCode, ha="left", va="center")

    sns.boxplot(data=dptc, y="DeathsProp", x="Year", ax=axs[x], whis=2)
    axs[x].set_title(disease_codes.loc[topCauses[x], "name"])
    axs[x].set_ylabel("Fraction of Deaths /1")

Circulatory: Guatemala had notably lower proportion of deaths due to this cause prior to 1991.

Neoplasms: Quite a large IQR with few outliers, indicating that no matter how developed a country is, cancer will still take away roughly the same proportion of lives. However, it has been increasing in recent years, as evidenced by the choropleths as well.

Respiratory: Before 2006, Guatemala had exceptionally high proportion of deaths due to this cause. After 2006, Singapore took its position and has maintained this outlier status for the past 10 years. I think this may be due to the high number of smokers after Singapore's GDP per capita rose and people could afford cigarettes.

Digestive: Egypt, North Macedonia and Mexico have all had high proportions of deaths due to this cause, with the latter 2's outlier status persevering from the 1980s until now. Kiribati also experienced a spike in this statistic in the 1990s. Perhaps it is due to the ulcer-inducing spicy food found in the borders of these countries.

Infectious and Parasitic: Guatemala and Kiribati have had notably higher proportions of death due to this cause from the 1980s to the 1990s. In recent years, Thailand and South Africa have dominated the charts, both being significantly outside of the range of 2 IQR. These countries have quite a bit of inequality (e.g. South Africa had apartheid) when it comes to urbanisation, so that may be why.

As BMI increases, the death rate for most causes of death decreases. This is probably because BMI only increases when a country becomes more developed and people can eat more nutritious meals.

The only exceptions are nervous system and behavioural disorders, which are positively correlated with BMI. I will give further explanation below.

On a country level, Russia appears to be an exception as it's number of deaths due to digestive system illnesses is also positively correlated with BMI.

Correlation and Regplots
Let us find out which causes of death are most correlated with BMI through correlation values and regplots.

In [37]:
bmiMort = pd.merge(
    longmdf, longbmidf, on=["CountryCode", "Country", "Year"], how="inner"
)
_bmiMort = pd.merge(longmdf, longbmidf, on=["CountryCode", "Year"], how="inner")
bmiMort.DeathsPer100k = bmiMort.DeathsPer100k.astype(float)
display(bmiMort)
display(_bmiMort)
CountryCode Country Cause CauseName Year DeathsPer100k BMI
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN 24.0
1 ALB Albania 2 Neoplasms 1979 NaN 24.0
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN 24.0
3 ALB Albania 6 Diseases of the nervous system 1979 NaN 24.0
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN 24.0
... ... ... ... ... ... ... ...
60985 VEN Venezuela, Bolivarian Republic of 12 Diseases of the skin and subcutaneous tissue 2016 NaN 26.7
60986 VEN Venezuela, Bolivarian Republic of 13 Diseases of the musculoskeletal system and con... 2016 NaN 26.7
60987 VEN Venezuela, Bolivarian Republic of 14 Diseases of the genitourinary system 2016 NaN 26.7
60988 VEN Venezuela, Bolivarian Republic of 15 Pregnancy, childbirth and the puerperium 2016 NaN 26.7
60989 VEN Venezuela, Bolivarian Republic of 16 Congenital malformations, deformations and chr... 2016 NaN 26.7

60990 rows × 7 columns

CountryCode Country_x Cause CauseName Year DeathsPer100k Country_y BMI
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN Albania 24.0
1 ALB Albania 2 Neoplasms 1979 NaN Albania 24.0
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN Albania 24.0
3 ALB Albania 6 Diseases of the nervous system 1979 NaN Albania 24.0
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN Albania 24.0
... ... ... ... ... ... ... ... ...
60985 VEN Venezuela, Bolivarian Republic of 12 Diseases of the skin and subcutaneous tissue 2016 NaN Venezuela, Bolivarian Republic of 26.7
60986 VEN Venezuela, Bolivarian Republic of 13 Diseases of the musculoskeletal system and con... 2016 NaN Venezuela, Bolivarian Republic of 26.7
60987 VEN Venezuela, Bolivarian Republic of 14 Diseases of the genitourinary system 2016 NaN Venezuela, Bolivarian Republic of 26.7
60988 VEN Venezuela, Bolivarian Republic of 15 Pregnancy, childbirth and the puerperium 2016 NaN Venezuela, Bolivarian Republic of 26.7
60989 VEN Venezuela, Bolivarian Republic of 16 Congenital malformations, deformations and chr... 2016 NaN Venezuela, Bolivarian Republic of 26.7

60990 rows × 8 columns

Notice how the number of rows is the same whether Country is include in the on parameter. This shows that the mapping between Country and CountryCode is 1-to-1.

We shall now proceed to plot regplots of every countries data throughout the years. We will be removing the outliers (defined arbitrarily based on quantile). If you decrease the quantile range, you would see that the number of data points remaining is decreased, which shows that the code is correct.

In [38]:
from scipy import stats
import math


def find_correl(df, a, b):
    dfnona = df.dropna()
    return stats.pearsonr(dfnona[a], dfnona[b])


# append correlation between two variables grouped by Cause
def append_correl(df, a1, b1):
    correls = df.groupby("Cause").apply(find_correl, a=a1, b=b1)
    diseases = df.Cause.unique()
    correldf = pd.DataFrame(
        [[a, correls[a][0], correls[a][1]] for a in diseases],
        columns=["Cause", "Correl", "pValue"],
    )
    return pd.merge(df, correldf, on="Cause")


def remove_outliers(df):
    outlier_boundaries = (
        df.groupby("Cause").DeathsPer100k.quantile([0.05, 0.95]).unstack(level=1)
    )
    nooutlier = df[
        (
            df.DeathsPer100k
            >= outlier_boundaries.iloc[df.Cause - 1, 0].reset_index(drop=True)
        )
        & (
            df.DeathsPer100k
            <= outlier_boundaries.iloc[df.Cause - 1, 1].reset_index(drop=True)
        )
    ]
    return nooutlier


# remove outliers + append_correl
display(bmiMort)
bmiMort_nooutlier = remove_outliers(bmiMort)
display(bmiMort_nooutlier)
bmiMort2 = append_correl(bmiMort_nooutlier.dropna(), "BMI", "DeathsPer100k")
display(bmiMort2)


def plot4x4reg(df, indepvar):
    # To make it work for other number of causes as well
    count = len(disease_codes)
    maxwidth = math.ceil(count**0.5)
    fig, axes = plt.subplots(maxwidth, maxwidth, figsize=(30, 20))
    diseases_sorted = df.sort_values(by="Correl").CauseName.unique()
    for ax, cause in zip(axes.ravel(), diseases_sorted):
        dfs = df.query("CauseName==@cause")
        sns.regplot(
            data=dfs,
            x=indepvar,
            y="DeathsPer100k",
            scatter_kws={"s": 3},
            x_jitter=0.05,
            ax=ax,
        ).set_title(
            cause + " (" + str(round(dfs.Correl.iloc[0], 2)) + ", p="
            # round p value to 3 d.p as anything less than 0.001 is statistically significant
            + str(round(dfs.pValue.iloc[0], 4)) + ")",
            fontsize=8,
        )


plot4x4reg(
    append_correl(bmiMort.dropna(), "BMI", "DeathsPer100k"), "BMI"
)  # with outliers
CountryCode Country Cause CauseName Year DeathsPer100k BMI
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN 24.0
1 ALB Albania 2 Neoplasms 1979 NaN 24.0
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN 24.0
3 ALB Albania 6 Diseases of the nervous system 1979 NaN 24.0
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN 24.0
... ... ... ... ... ... ... ...
60985 VEN Venezuela, Bolivarian Republic of 12 Diseases of the skin and subcutaneous tissue 2016 NaN 26.7
60986 VEN Venezuela, Bolivarian Republic of 13 Diseases of the musculoskeletal system and con... 2016 NaN 26.7
60987 VEN Venezuela, Bolivarian Republic of 14 Diseases of the genitourinary system 2016 NaN 26.7
60988 VEN Venezuela, Bolivarian Republic of 15 Pregnancy, childbirth and the puerperium 2016 NaN 26.7
60989 VEN Venezuela, Bolivarian Republic of 16 Congenital malformations, deformations and chr... 2016 NaN 26.7

60990 rows × 7 columns

CountryCode Country Cause CauseName Year DeathsPer100k BMI
30 ARG Argentina 1 Certain infectious and parasitic diseases 1979 29.8 24.4
31 ARG Argentina 2 Neoplasms 1979 155.2 24.4
34 ARG Argentina 5 Mental and behavioural disorders 1979 3.6 24.4
35 ARG Argentina 6 Diseases of the nervous system 1979 9.9 24.4
36 ARG Argentina 7 Diseases of the eye and adnexa 1979 0.0 24.4
... ... ... ... ... ... ... ...
60889 USA United States 12 Diseases of the skin and subcutaneous tissue 2016 0.8 28.9
60890 USA United States 13 Diseases of the musculoskeletal system and con... 2016 2.4 28.9
60891 USA United States 14 Diseases of the genitourinary system 2016 11.1 28.9
60892 USA United States 15 Pregnancy, childbirth and the puerperium 2016 0.8 28.9
60893 USA United States 16 Congenital malformations, deformations and chr... 2016 3.7 28.9

37897 rows × 7 columns

CountryCode Country Cause CauseName Year DeathsPer100k BMI Correl pValue
0 ARG Argentina 1 Certain infectious and parasitic diseases 1979 29.8 24.4 -0.127130 1.409419e-10
1 AUS Australia 1 Certain infectious and parasitic diseases 1979 3.4 24.2 -0.127130 1.409419e-10
2 BRB Barbados 1 Certain infectious and parasitic diseases 1979 17.8 25.3 -0.127130 1.409419e-10
3 BEL Belgium 1 Certain infectious and parasitic diseases 1979 5.4 24.9 -0.127130 1.409419e-10
4 CUB Cuba 1 Certain infectious and parasitic diseases 1979 11.9 23.0 -0.127130 1.409419e-10
... ... ... ... ... ... ... ... ... ...
37892 MDA Moldova, Republic of 4 Endocrine, nutritional and metabolic diseases 2016 8.7 27.0 0.217845 1.302295e-15
37893 ROU Romania 4 Endocrine, nutritional and metabolic diseases 2016 7.3 26.9 0.217845 1.302295e-15
37894 SWE Sweden 4 Endocrine, nutritional and metabolic diseases 2016 10.5 26.0 0.217845 1.302295e-15
37895 THA Thailand 4 Endocrine, nutritional and metabolic diseases 2016 18.9 24.4 0.217845 1.302295e-15
37896 USA United States 4 Endocrine, nutritional and metabolic diseases 2016 21.5 28.9 0.217845 1.302295e-15

37897 rows × 9 columns

Now, without outliers the regplots are much tighter along the y-axis.

In [39]:
plot4x4reg(bmiMort2, "BMI")

As one can see, there is a huge amount of variability in the data (barring the jitter set in place to compensate for decimal point issues) so there is no strong correlation between death originating from major causes and BMI. However, what if we take the average of DeathsPer100k for every country per year. This would theoretically fix the massive amount of variation between the developed and developing countries at any point in time.

In [40]:
def get_mean_deaths(df, indepvar):
    meanDeaths = df.groupby(["Year", "Cause", "CauseName"]).agg(
        {"DeathsPer100k": ["mean"], indepvar: ["mean"]}
    )
    meanDeaths.reset_index(inplace=True)
    meanDeaths.columns = meanDeaths.columns.droplevel(1)
    return meanDeaths


def reg_mean_deaths(df, indepvar):
    meanDeaths = get_mean_deaths(df, indepvar)
    meanDeaths = append_correl(meanDeaths, indepvar, "DeathsPer100k")
    display(meanDeaths)

    sns.color_palette("bright")
    sns.lmplot(
        data=meanDeaths,
        x=indepvar,
        y="DeathsPer100k",
        hue="CauseName",
        scatter_kws={"s": 5},
        height=8,
        aspect=1.5,
        legend=False,
    )
    legend = (
        meanDeaths.CauseName
        + " ("
        + meanDeaths.Correl.round(2).astype(str)
        + ", p="
        + meanDeaths.pValue.round(4).astype(str)
        + ")"
    ).unique()
    plt.legend(title="Cause", bbox_to_anchor=(1.04, 1), loc="upper left", labels=legend)

    plot4x4reg(meanDeaths, indepvar)


display(bmiMort2)
reg_mean_deaths(bmiMort2, "BMI")
CountryCode Country Cause CauseName Year DeathsPer100k BMI Correl pValue
0 ARG Argentina 1 Certain infectious and parasitic diseases 1979 29.8 24.4 -0.127130 1.409419e-10
1 AUS Australia 1 Certain infectious and parasitic diseases 1979 3.4 24.2 -0.127130 1.409419e-10
2 BRB Barbados 1 Certain infectious and parasitic diseases 1979 17.8 25.3 -0.127130 1.409419e-10
3 BEL Belgium 1 Certain infectious and parasitic diseases 1979 5.4 24.9 -0.127130 1.409419e-10
4 CUB Cuba 1 Certain infectious and parasitic diseases 1979 11.9 23.0 -0.127130 1.409419e-10
... ... ... ... ... ... ... ... ... ...
37892 MDA Moldova, Republic of 4 Endocrine, nutritional and metabolic diseases 2016 8.7 27.0 0.217845 1.302295e-15
37893 ROU Romania 4 Endocrine, nutritional and metabolic diseases 2016 7.3 26.9 0.217845 1.302295e-15
37894 SWE Sweden 4 Endocrine, nutritional and metabolic diseases 2016 10.5 26.0 0.217845 1.302295e-15
37895 THA Thailand 4 Endocrine, nutritional and metabolic diseases 2016 18.9 24.4 0.217845 1.302295e-15
37896 USA United States 4 Endocrine, nutritional and metabolic diseases 2016 21.5 28.9 0.217845 1.302295e-15

37897 rows × 9 columns

Year Cause CauseName DeathsPer100k BMI Correl pValue
0 1979 1 Certain infectious and parasitic diseases 20.114815 24.118519 -0.763811 2.425297e-08
1 1980 1 Certain infectious and parasitic diseases 21.067442 24.151163 -0.763811 2.425297e-08
2 1981 1 Certain infectious and parasitic diseases 21.580000 24.167273 -0.763811 2.425297e-08
3 1982 1 Certain infectious and parasitic diseases 20.972222 24.283333 -0.763811 2.425297e-08
4 1983 1 Certain infectious and parasitic diseases 18.088372 24.306977 -0.763811 2.425297e-08
... ... ... ... ... ... ... ...
573 2012 4 Endocrine, nutritional and metabolic diseases 27.324324 26.578378 0.309181 1.511319e-01
574 2013 4 Endocrine, nutritional and metabolic diseases 26.913889 26.665278 0.309181 1.511319e-01
575 2014 4 Endocrine, nutritional and metabolic diseases 26.654286 26.652857 0.309181 1.511319e-01
576 2015 4 Endocrine, nutritional and metabolic diseases 23.430189 26.664151 0.309181 1.511319e-01
577 2016 4 Endocrine, nutritional and metabolic diseases 16.120000 26.726667 0.309181 1.511319e-01

578 rows × 7 columns

As one can see, the gradient of the regression lines (i.e. whether they are increasing or decreasing) are in line with the correlation values in the legend. This shows that this plot is correct.

Analysis: One would expect that as BMI increases, deaths due to cardiovascular reasons would increase. However this is not the case. There is a very strong negative correlation (c=-0.93) between BMI and circulatory system diseases. The same is observed for many other causes of death. I think this is because a country with high BMI is a country that can cater to the nutritional needs of its citizens very well, so it is developed and healthcare standards should be high as well, and deaths due to various illnesses decrease. The only cause that shows a strong positive correlation with BMI is mental and behavioural disorders. Perhaps this is due to the same reason as the choropleth maps above. I.e., in developed countries, people with mental illnesses actually survive long enough to be considered a valid data point.

Let us now try to narrow it down to any particular country. We will keep the outliers in for this one as while the deaths figures may be outliers on the global scale, on a country-by-country basis they could still be statistically significant. NOTE: I will only be drawing conclusions from "important" countries such as the United States, but this model can be used for any country with data.

In [144]:
def reg_per_country(df, indepvar):
    inp = ""
    while inp.lower() != "stop":
        inp = input("Enter alpha-3 country code: ").upper()
        if inp in df.CountryCode.values:
            country = df[df.CountryCode == inp]
            print(country.Country.iloc[0])

            # Plot only causes that have a variety of number of deaths.
            gbCause = country.groupby("Cause")
            largeDiff = gbCause.DeathsPer100k.max() - gbCause.DeathsPer100k.min() > 10
            country = country[largeDiff[country.Cause].values]
            country.drop(
                ["Correl", "pValue"], axis=1, inplace=True
            )  # recalculate correlation

            country = append_correl(country, indepvar, "DeathsPer100k")
            toPlot = country[
                # Ensure correlation is strong and valid
                (abs(country.Correl) >= 0.5)
                & (country.pValue <= 0.05)
            ]

            # Sort by sum of deaths to make the line colors a bit more in sync for easier cross-referencing.
            longmdf.DeathsPer100k = longmdf.DeathsPer100k.astype(float)
            sort_by_this = (
                longmdf.groupby("Cause").DeathsPer100k.sum().sort_values().index
            )
            toPlot.Cause = toPlot.Cause.astype("category")
            toPlot.Cause.cat.set_categories(sort_by_this, inplace=True)
            toPlot.sort_values(["Cause"], inplace=True, ascending=False)
            display(toPlot)

            plot = sns.lmplot(
                data=toPlot,
                x=indepvar,
                y="DeathsPer100k",
                hue="CauseName",
                scatter_kws={"s": 10},
                height=10,
                aspect=1.5,
                legend=False,
            )
            legend = (
                toPlot.CauseName
                + " ("
                + toPlot.Correl.round(2).astype(str)
                + ", p="
                + toPlot.pValue.round(4).astype(str)
                + ")"
            ).unique()
            plt.legend(title="Cause", loc="upper right", labels=legend)
            plt.show(plot)


reg_per_country(append_correl(bmiMort, "BMI", "DeathsPer100k").dropna(), "BMI")
United States
CountryCode Country Cause CauseName Year DeathsPer100k BMI Correl pValue
189 USA United States 9 Diseases of the circulatory system 2016 131.8 28.9 -0.993215 3.041203e-35
161 USA United States 9 Diseases of the circulatory system 1988 266.8 26.0 -0.993215 3.041203e-35
168 USA United States 9 Diseases of the circulatory system 1995 226.7 26.9 -0.993215 3.041203e-35
167 USA United States 9 Diseases of the circulatory system 1994 228.8 26.7 -0.993215 3.041203e-35
166 USA United States 9 Diseases of the circulatory system 1993 235.2 26.6 -0.993215 3.041203e-35
... ... ... ... ... ... ... ... ... ...
103 USA United States 5 Mental and behavioural disorders 2006 14.7 28.2 0.921908 2.129032e-16
102 USA United States 5 Mental and behavioural disorders 2005 12.1 28.1 0.921908 2.129032e-16
101 USA United States 5 Mental and behavioural disorders 2004 10.9 28.0 0.921908 2.129032e-16
100 USA United States 5 Mental and behavioural disorders 2003 10.8 27.9 0.921908 2.129032e-16
78 USA United States 5 Mental and behavioural disorders 1981 5.2 25.3 0.921908 2.129032e-16

152 rows × 9 columns

Russian Federation
CountryCode Country Cause CauseName Year DeathsPer100k BMI Correl pValue
54 RUS Russian Federation 2 Neoplasms 1998 151.7 25.7 -0.768837 4.329938e-08
44 RUS Russian Federation 2 Neoplasms 1988 156.0 25.5 -0.768837 4.329938e-08
50 RUS Russian Federation 2 Neoplasms 1994 161.8 25.6 -0.768837 4.329938e-08
49 RUS Russian Federation 2 Neoplasms 1993 162.9 25.6 -0.768837 4.329938e-08
48 RUS Russian Federation 2 Neoplasms 1992 160.9 25.6 -0.768837 4.329938e-08
... ... ... ... ... ... ... ... ... ...
83 RUS Russian Federation 6 Diseases of the nervous system 2010 10.2 26.1 0.538560 2.571789e-02
84 RUS Russian Federation 6 Diseases of the nervous system 2011 10.1 26.1 0.538560 2.571789e-02
85 RUS Russian Federation 6 Diseases of the nervous system 2012 9.7 26.1 0.538560 2.571789e-02
86 RUS Russian Federation 6 Diseases of the nervous system 2013 11.3 26.2 0.538560 2.571789e-02
80 RUS Russian Federation 6 Diseases of the nervous system 2007 8.6 26.0 0.538560 2.571789e-02

161 rows × 9 columns

United Kingdom
CountryCode Country Cause CauseName Year DeathsPer100k BMI Correl pValue
112 GBR United Kingdom 9 Diseases of the circulatory system 1980 361.3 24.1 -0.992869 6.316933e-34
138 GBR United Kingdom 9 Diseases of the circulatory system 2006 146.5 26.8 -0.992869 6.316933e-34
131 GBR United Kingdom 9 Diseases of the circulatory system 1999 205.2 26.2 -0.992869 6.316933e-34
132 GBR United Kingdom 9 Diseases of the circulatory system 2000 191.3 26.3 -0.992869 6.316933e-34
133 GBR United Kingdom 9 Diseases of the circulatory system 2001 191.5 26.4 -0.992869 6.316933e-34
... ... ... ... ... ... ... ... ... ...
60 GBR United Kingdom 5 Mental and behavioural disorders 2002 13.0 26.5 0.730282 2.894156e-07
59 GBR United Kingdom 5 Mental and behavioural disorders 2001 13.0 26.4 0.730282 2.894156e-07
58 GBR United Kingdom 5 Mental and behavioural disorders 2000 10.6 26.3 0.730282 2.894156e-07
57 GBR United Kingdom 5 Mental and behavioural disorders 1999 10.7 26.2 0.730282 2.894156e-07
72 GBR United Kingdom 5 Mental and behavioural disorders 2014 23.4 27.1 0.730282 2.894156e-07

185 rows × 9 columns

Brazil
CountryCode Country Cause CauseName Year DeathsPer100k BMI Correl pValue
110 BRA Brazil 9 Diseases of the circulatory system 2015 167.5 26.5 -0.900032 3.513113e-14
82 BRA Brazil 9 Diseases of the circulatory system 1987 237.9 23.7 -0.900032 3.513113e-14
90 BRA Brazil 9 Diseases of the circulatory system 1995 285.7 24.4 -0.900032 3.513113e-14
89 BRA Brazil 9 Diseases of the circulatory system 1994 292.0 24.3 -0.900032 3.513113e-14
88 BRA Brazil 9 Diseases of the circulatory system 1993 299.4 24.2 -0.900032 3.513113e-14
... ... ... ... ... ... ... ... ... ...
24 BRA Brazil 1 Certain infectious and parasitic diseases 2003 31.5 25.1 -0.862668 6.689014e-12
23 BRA Brazil 1 Certain infectious and parasitic diseases 2002 31.1 25.0 -0.862668 6.689014e-12
22 BRA Brazil 1 Certain infectious and parasitic diseases 2001 31.7 24.9 -0.862668 6.689014e-12
21 BRA Brazil 1 Certain infectious and parasitic diseases 2000 32.0 24.8 -0.862668 6.689014e-12
0 BRA Brazil 1 Certain infectious and parasitic diseases 1979 62.6 23.0 -0.862668 6.689014e-12

111 rows × 9 columns

United States: Nervous system illnesses and mental disorders are related, and they are both positively correlated with BMI. The only explanation I can think of is the one previously mentioned. U.S.'s data is in accordance with the rest of the world.

Russia: The main thing to note here is the strong positive correlation between digestive system diseases and BMI. Perhaps the Russians' diet is quite unhealthy, so as they consume more nutrition, their risk of digestive system illness also increase.

United Kingdom: Similar to U.S., except there is also a negative correlation bewtween respiratory system illnesses and BMI. Perhaps as the U.K. became more developed, people gained the awareness to smoke less and environmental standards went up as well, reducing the number of deaths due to this cause.

Brazil: Nothing out of the ordinary. Developing countries are not that different from developed countries in terms of this correlation.

3. How does the average Child Mortality Rate (CMR) correlate with the causes of mortality at any point in time?

As CMR increases, the death rate for most causes of death increases. This is probably because CMR is only high when a country has poor health infrastructure.

The only exceptions are nervous system and behavioural disorders, which are negatively correlated with CMR.

On a country level, Russia appears to be an exception as it's number of deaths due to digestive system illnesses is also negatively correlated with CMR.

In [42]:
cmrMort = pd.merge(
    longmdf, longcmrdf, on=["CountryCode", "Country", "Year"], how="inner"
)
_cmrMort = pd.merge(longmdf, longcmrdf, on=["CountryCode", "Year"], how="inner")
cmrMort.DeathsPer100k = cmrMort.DeathsPer100k.astype(float)
display(cmrMort)
display(_cmrMort)
CountryCode Country Cause CauseName Year DeathsPer100k CMR
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN 71.181438
1 ALB Albania 2 Neoplasms 1979 NaN 71.181438
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN 71.181438
3 ALB Albania 6 Diseases of the nervous system 1979 NaN 71.181438
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN 71.181438
... ... ... ... ... ... ... ...
61593 VEN Venezuela, Bolivarian Republic of 12 Diseases of the skin and subcutaneous tissue 2016 NaN 21.044404
61594 VEN Venezuela, Bolivarian Republic of 13 Diseases of the musculoskeletal system and con... 2016 NaN 21.044404
61595 VEN Venezuela, Bolivarian Republic of 14 Diseases of the genitourinary system 2016 NaN 21.044404
61596 VEN Venezuela, Bolivarian Republic of 15 Pregnancy, childbirth and the puerperium 2016 NaN 21.044404
61597 VEN Venezuela, Bolivarian Republic of 16 Congenital malformations, deformations and chr... 2016 NaN 21.044404

61598 rows × 7 columns

CountryCode Country_x Cause CauseName Year DeathsPer100k Country_y CMR
0 ALB Albania 1 Certain infectious and parasitic diseases 1979 NaN Albania 71.181438
1 ALB Albania 2 Neoplasms 1979 NaN Albania 71.181438
2 ALB Albania 5 Mental and behavioural disorders 1979 NaN Albania 71.181438
3 ALB Albania 6 Diseases of the nervous system 1979 NaN Albania 71.181438
4 ALB Albania 7 Diseases of the eye and adnexa 1979 NaN Albania 71.181438
... ... ... ... ... ... ... ... ...
61593 VEN Venezuela, Bolivarian Republic of 12 Diseases of the skin and subcutaneous tissue 2016 NaN Venezuela, Bolivarian Republic of 21.044404
61594 VEN Venezuela, Bolivarian Republic of 13 Diseases of the musculoskeletal system and con... 2016 NaN Venezuela, Bolivarian Republic of 21.044404
61595 VEN Venezuela, Bolivarian Republic of 14 Diseases of the genitourinary system 2016 NaN Venezuela, Bolivarian Republic of 21.044404
61596 VEN Venezuela, Bolivarian Republic of 15 Pregnancy, childbirth and the puerperium 2016 NaN Venezuela, Bolivarian Republic of 21.044404
61597 VEN Venezuela, Bolivarian Republic of 16 Congenital malformations, deformations and chr... 2016 NaN Venezuela, Bolivarian Republic of 21.044404

61598 rows × 8 columns

As per BMI, cmrdf's Country column is also in accordance with that of mdf. Now, let's plot some regplots but without outliers.

In [43]:
cmrMort2 = append_correl(remove_outliers(cmrMort).dropna(), "CMR", "DeathsPer100k")
display(cmrMort2)
plot4x4reg(cmrMort2, "CMR")
CountryCode Country Cause CauseName Year DeathsPer100k CMR Correl pValue
0 ARG Argentina 1 Certain infectious and parasitic diseases 1979 29.8 40.094603 0.620171 2.058840e-268
1 AUS Australia 1 Certain infectious and parasitic diseases 1979 3.4 11.297138 0.620171 2.058840e-268
2 BRB Barbados 1 Certain infectious and parasitic diseases 1979 17.8 24.888679 0.620171 2.058840e-268
3 BEL Belgium 1 Certain infectious and parasitic diseases 1979 5.4 12.876635 0.620171 2.058840e-268
4 CUB Cuba 1 Certain infectious and parasitic diseases 1979 11.9 19.279995 0.620171 2.058840e-268
... ... ... ... ... ... ... ... ... ...
37881 MDA Moldova, Republic of 4 Endocrine, nutritional and metabolic diseases 2016 8.7 13.149784 0.385809 4.247712e-48
37882 ROU Romania 4 Endocrine, nutritional and metabolic diseases 2016 7.3 7.112852 0.385809 4.247712e-48
37883 SWE Sweden 4 Endocrine, nutritional and metabolic diseases 2016 10.5 2.287122 0.385809 4.247712e-48
37884 THA Thailand 4 Endocrine, nutritional and metabolic diseases 2016 18.9 8.854661 0.385809 4.247712e-48
37885 USA United States 4 Endocrine, nutritional and metabolic diseases 2016 21.5 5.764999 0.385809 4.247712e-48

37886 rows × 9 columns

Same as BMI, correlation on a global scale has too much variance to be statistically significant. The cause with the highest correlation is related to childbirth, which makes sense as a high number of the deaths occurring due to childbirth can be classified as "child mortality".

In [44]:
reg_mean_deaths(cmrMort2, "CMR")
Year Cause CauseName DeathsPer100k CMR Correl pValue
0 1979 1 Certain infectious and parasitic diseases 19.450000 21.749274 0.755045 4.318056e-08
1 1980 1 Certain infectious and parasitic diseases 20.597619 25.321455 0.755045 4.318056e-08
2 1981 1 Certain infectious and parasitic diseases 20.913208 27.414685 0.755045 4.318056e-08
3 1982 1 Certain infectious and parasitic diseases 20.358491 27.987798 0.755045 4.318056e-08
4 1983 1 Certain infectious and parasitic diseases 18.088372 20.594925 0.755045 4.318056e-08
... ... ... ... ... ... ... ...
573 2012 4 Endocrine, nutritional and metabolic diseases 27.324324 9.620851 0.325408 1.297328e-01
574 2013 4 Endocrine, nutritional and metabolic diseases 26.827397 9.601338 0.325408 1.297328e-01
575 2014 4 Endocrine, nutritional and metabolic diseases 26.581690 8.949516 0.325408 1.297328e-01
576 2015 4 Endocrine, nutritional and metabolic diseases 23.637037 8.447201 0.325408 1.297328e-01
577 2016 4 Endocrine, nutritional and metabolic diseases 16.120000 6.185677 0.325408 1.297328e-01

578 rows × 7 columns

The correlations are almost the exact opposite of that of BMI. Diseases of the nervous system and mental disorders are positively correlated with CMR, once again proving my hypothesis that many people with undiagnosed mental disorders are dying young, so they will not be tallied as a death due to those causes.

In [145]:
reg_per_country(append_correl(cmrMort, "CMR", "DeathsPer100k").dropna(), "CMR")
United States
CountryCode Country Cause CauseName Year DeathsPer100k CMR Correl pValue
189 USA United States 9 Diseases of the circulatory system 2016 131.8 5.764999 0.973829 9.290551e-25
161 USA United States 9 Diseases of the circulatory system 1988 266.8 9.929447 0.973829 9.290551e-25
168 USA United States 9 Diseases of the circulatory system 1995 226.7 7.929688 0.973829 9.290551e-25
167 USA United States 9 Diseases of the circulatory system 1994 228.8 8.191276 0.973829 9.290551e-25
166 USA United States 9 Diseases of the circulatory system 1993 235.2 8.480499 0.973829 9.290551e-25
... ... ... ... ... ... ... ... ... ...
103 USA United States 5 Mental and behavioural disorders 2006 14.7 6.666849 -0.840583 4.040001e-11
102 USA United States 5 Mental and behavioural disorders 2005 12.1 6.745975 -0.840583 4.040001e-11
101 USA United States 5 Mental and behavioural disorders 2004 10.9 6.816018 -0.840583 4.040001e-11
100 USA United States 5 Mental and behavioural disorders 2003 10.8 6.880536 -0.840583 4.040001e-11
78 USA United States 5 Mental and behavioural disorders 1981 5.2 12.102701 -0.840583 4.040001e-11

152 rows × 9 columns

Russian Federation
CountryCode Country Cause CauseName Year DeathsPer100k CMR Correl pValue
36 RUS Russian Federation 2 Neoplasms 1980 144.1 23.160614 0.749570 1.434614e-07
55 RUS Russian Federation 2 Neoplasms 1999 152.5 17.145943 0.749570 1.434614e-07
57 RUS Russian Federation 2 Neoplasms 2001 148.2 15.699803 0.749570 1.434614e-07
58 RUS Russian Federation 2 Neoplasms 2002 146.6 14.787114 0.749570 1.434614e-07
59 RUS Russian Federation 2 Neoplasms 2003 145.0 13.800184 0.749570 1.434614e-07
... ... ... ... ... ... ... ... ... ...
75 RUS Russian Federation 6 Diseases of the nervous system 2002 9.3 14.787114 -0.527442 2.956826e-02
74 RUS Russian Federation 6 Diseases of the nervous system 2001 8.8 15.699803 -0.527442 2.956826e-02
73 RUS Russian Federation 6 Diseases of the nervous system 2000 8.4 16.499216 -0.527442 2.956826e-02
72 RUS Russian Federation 6 Diseases of the nervous system 1999 8.1 17.145943 -0.527442 2.956826e-02
77 RUS Russian Federation 6 Diseases of the nervous system 2004 9.7 12.811291 -0.527442 2.956826e-02

125 rows × 9 columns

Brazil
CountryCode Country Cause CauseName Year DeathsPer100k CMR Correl pValue
110 BRA Brazil 9 Diseases of the circulatory system 2015 167.5 14.046389 0.846563 4.071673e-11
82 BRA Brazil 9 Diseases of the circulatory system 1987 237.9 58.420730 0.846563 4.071673e-11
90 BRA Brazil 9 Diseases of the circulatory system 1995 285.7 40.903623 0.846563 4.071673e-11
89 BRA Brazil 9 Diseases of the circulatory system 1994 292.0 43.342598 0.846563 4.071673e-11
88 BRA Brazil 9 Diseases of the circulatory system 1993 299.4 45.764745 0.846563 4.071673e-11
... ... ... ... ... ... ... ... ... ...
24 BRA Brazil 1 Certain infectious and parasitic diseases 2003 31.5 25.023282 0.906555 1.137526e-14
23 BRA Brazil 1 Certain infectious and parasitic diseases 2002 31.1 26.703441 0.906555 1.137526e-14
22 BRA Brazil 1 Certain infectious and parasitic diseases 2001 31.7 28.459729 0.906555 1.137526e-14
21 BRA Brazil 1 Certain infectious and parasitic diseases 2000 32.0 30.300841 0.906555 1.137526e-14
0 BRA Brazil 1 Certain infectious and parasitic diseases 1979 62.6 79.486312 0.906555 1.137526e-14

111 rows × 9 columns

United Kingdom
CountryCode Country Cause CauseName Year DeathsPer100k CMR Correl pValue
112 GBR United Kingdom 9 Diseases of the circulatory system 1980 361.3 12.068115 0.961258 3.592956e-21
138 GBR United Kingdom 9 Diseases of the circulatory system 2006 146.5 5.037681 0.961258 3.592956e-21
131 GBR United Kingdom 9 Diseases of the circulatory system 1999 205.2 5.659105 0.961258 3.592956e-21
132 GBR United Kingdom 9 Diseases of the circulatory system 2000 191.3 5.569100 0.961258 3.592956e-21
133 GBR United Kingdom 9 Diseases of the circulatory system 2001 191.5 5.483651 0.961258 3.592956e-21
... ... ... ... ... ... ... ... ... ...
60 GBR United Kingdom 5 Mental and behavioural disorders 2002 13.0 5.401056 -0.704435 1.139751e-06
59 GBR United Kingdom 5 Mental and behavioural disorders 2001 13.0 5.483651 -0.704435 1.139751e-06
58 GBR United Kingdom 5 Mental and behavioural disorders 2000 10.6 5.569100 -0.704435 1.139751e-06
57 GBR United Kingdom 5 Mental and behavioural disorders 1999 10.7 5.659105 -0.704435 1.139751e-06
72 GBR United Kingdom 5 Mental and behavioural disorders 2014 23.4 3.892211 -0.704435 1.139751e-06

185 rows × 9 columns

Almost the reverse correlations as BMI. Once again, Russia is the anomaly here, with deaths due to digestive diseases correlating negatively with CMR. I cannot explain this with a direct reason. Perhaps it is due to the same reason as BMI, where as Russia gets more developed, its citizens eat unhealthily enough to develop digestive disorders.

The only cause of death that could be predicted to some extent from these two statistics is illnesses of the circulatory system. The model that has been created works best when the country is in the transition phase between developing and developed.

We will use the data for every developed country (HDI > 0.9) as over the course of 38 years, most of them have went through several stages of development and can be used as a predictor for the rest of the world in the future. This would be free from the variance (perhaps due to war and instabilities) that's present in developing countries. Feel free to adjust the HDI threshold to something other than 0.9, but from my experience doing so, the resulting model is really all over the place. The scatterplots look to be randomly generated in that case. The failed graphs have been omitted for the sake of saliency.

I will not be taking the mean of all countries throughout the years as that produces too few data points and is prone to overfitting.

In [117]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

altogether = pd.merge(
    bmiMort2.drop(['Correl', 'pValue'], axis=1),
    cmrMort2[["CountryCode", "Cause", "Year", "CMR"]],
    on=["CountryCode", "Cause", "Year"],
)

# to show that there is an hdi value for every country code in altogether
# notice number of rows is the same
display(len(altogether))
altogether = pd.merge(altogether, hdidf, on=["CountryCode", "Country"], how="left")
display(len(altogether))

# feel free to uncomment these lines for mean deaths per year,
# but there are too few data points, leading to underfitting
# when it comes to a country by country basis.

# it will not be plotted for the sake of saliency.

# for_mlm = pd.concat([get_mean_deaths(altogether,'BMI'),
#                      get_mean_deaths(altogether,'CMR')['CMR']],
#                     axis=1)
for_mlm = altogether[altogether.HDI > 0.9]
for_mlm
37836
37836
Out[117]:
CountryCode Country Cause CauseName Year DeathsPer100k BMI CMR HDI
1 AUS Australia 1 Certain infectious and parasitic diseases 1979 3.4 24.2 11.297138 0.938379
3 BEL Belgium 1 Certain infectious and parasitic diseases 1979 5.4 24.9 12.876635 0.918841
8 IRL Ireland 1 Certain infectious and parasitic diseases 1979 8.4 24.0 13.043885 0.942473
9 ISR Israel 1 Certain infectious and parasitic diseases 1979 12.1 24.6 16.275730 0.906235
11 JPN Japan 1 Certain infectious and parasitic diseases 1979 10.4 22.2 7.832282 0.914696
... ... ... ... ... ... ... ... ... ...
37822 AUT Austria 4 Endocrine, nutritional and metabolic diseases 2016 19.3 25.6 2.993266 0.913809
37827 ISL Iceland 4 Endocrine, nutritional and metabolic diseases 2016 8.2 26.2 1.707952 0.938474
37829 NLD Netherlands 4 Endocrine, nutritional and metabolic diseases 2016 9.4 25.6 3.428635 0.933495
37833 SWE Sweden 4 Endocrine, nutritional and metabolic diseases 2016 10.5 26.0 2.287122 0.936628
37835 USA United States 4 Endocrine, nutritional and metabolic diseases 2016 21.5 28.9 5.764999 0.919926

9884 rows × 9 columns

With the combined dataset manufactured, we can now proceed to create a multiple linear regression model using BMI and CMR to predict any cause of death we so choose. This data is free from outliers. distplots will be used to determine roughly how good a model they are.

In [118]:
def build_model(cause):
    mlm = LinearRegression()
    by_cause = for_mlm.query("Cause==@cause")
    # normalize
    by_cause.DeathsPer100k /= (
        by_cause.DeathsPer100k.max() - by_cause.DeathsPer100k.min()
    )
    x_train, x_test, y_train, y_test = train_test_split(
        by_cause[["CMR", "BMI"]],
        by_cause["DeathsPer100k"],
        test_size=0.2,
        random_state=4132,
    )
    mlm.fit(x_train, y_train)
    return (mlm.intercept_, mlm.coef_, mlm.predict(x_test), y_test)


fig, axes = plt.subplots(len(disease_codes), 1, figsize=(20, 80))
fig.tight_layout(pad=3.0)
models = []
for cause in range(1, len(disease_codes) + 1):
    model = build_model(cause)
    name = disease_codes.loc[cause, "name"]
    models.append([name, mean_squared_error(model[3], model[2]), model])
    ax1 = sns.distplot(
        model[3], hist=False, color="r", label="Actual Value", ax=axes[cause - 1]
    )
    axes[cause - 1].set_title(name)
    sns.distplot(model[2], hist=False, color="b", label="Fitted Value", ax=ax1)

The model is really bad at predictions for most causes of death, so let's narrow it down a bit.

Now, let's take the top five models ranked by mean-squared-error and see how they fare using some bar plots.

In [119]:
models.sort(key=lambda x: x[1])
fig, axes = plt.subplots(5, figsize=(30, 40))
for x in range(5):
    name = models[x][0]
    mse = models[x][1]
    actual_vs_predicted = pd.DataFrame([models[x][2][2], models[x][2][3]]).T
    axes[x].set_title(name)
    actual_vs_predicted.columns = ["Predicted", "Actual"]
    actual_vs_predicted.sort_values(by="Predicted", inplace=True)
    actual_vs_predicted.plot(kind="bar", ax=axes[x])
In [120]:
fig, axes = plt.subplots(3,2, figsize=(30, 30))
for x in range(5):
    name = models[x][0]
    mse = models[x][1]
    actual_vs_predicted = pd.DataFrame([models[x][2][2], models[x][2][3]]).T
    axes.ravel()[x].set_title(name)
    actual_vs_predicted.columns = ["Predicted", "Actual"]
    actual_vs_predicted.sort_values(by="Predicted", inplace=True)
    actual_vs_predicted.plot(kind="scatter", x='Predicted', y='Actual',ax=axes.ravel()[x])

The model for eye disease is clearly inaccurate. Diseases relating to blood has too few data points and is also too inaccurate. The only reason the model for infectious and parasitic diseases has low MSE is because few people die from infectious diseases in developed countries. That is clearly not a good model either.

Congenital malformations is directly correlated with CMR for obvious reasons, so it is not interesting. The only one left that is somewhat interesting is diseases of the circulatory system. Let's do some more analysis on that.

In [121]:
display(models[2])
['Diseases of the circulatory system',
 0.01158372364476246,
 (1.4572018205979171,
  array([ 0.06097839, -0.04670699]),
  array([0.81425328, 0.90133845, 0.53914445, 0.56840262, 0.7025814 ,
         0.36899801, 0.49704923, 0.56184976, 0.72200838, 0.55433993,
         0.8862405 , 0.84420291, 0.4563368 , 0.55605172, 0.50027008,
         0.58473738, 0.49651417, 0.38878218, 0.61824463, 0.56133623,
         0.66455559, 0.53147336, 0.57068838, 0.60881048, 0.67198644,
         0.9411238 , 0.65503264, 0.6698129 , 0.74133971, 1.07797976,
         0.6476085 , 0.40456543, 0.45968747, 0.59838638, 0.35249979,
         0.41716189, 0.58513929, 0.97761413, 0.6942383 , 1.02330925,
         0.4762497 , 0.61602216, 0.54211881, 0.37986249, 0.53374805,
         0.47761484, 0.47570019, 0.68432329, 0.53618502, 0.60568221,
         0.64979416, 0.81959285, 0.87629052, 0.8966906 , 0.84603105,
         0.92771282, 0.63774494, 0.86043782, 0.58243312, 0.79754763,
         0.81711635, 1.11109557, 0.80379768, 0.59582106, 0.68669485,
         0.8140288 , 0.59257648, 0.99923773, 0.50242586, 0.55659363,
         0.5785617 , 0.69160118, 0.58649357, 0.54296214, 1.1131506 ,
         0.72593756, 0.55614739, 0.46982978, 0.92167412, 0.70961587,
         0.91136883, 0.42583398, 0.38228499, 0.49471431, 0.83630508,
         1.09303275, 0.45501612, 0.53162798, 0.56239656, 0.38489047,
         0.48247593, 0.70361165, 0.43843871, 0.5488142 , 0.50971251,
         1.07787174, 0.90601686, 0.93966934, 0.74241416, 0.98067893,
         0.46690957, 0.45453409, 0.55178391, 0.59710476, 0.8811169 ,
         0.48438655, 0.75006767, 0.61750098, 0.6702431 , 0.68500385,
         0.4863754 , 0.87932318, 0.85128903, 0.69048756, 0.64120953,
         0.52387482, 0.5048359 , 0.64662072, 0.68286349, 0.58394156,
         0.69114843, 0.61173977, 0.39537349, 0.65404497, 0.73192293,
         0.51520287, 1.12203386, 0.80713025]),
  15215    0.821342
  14919    0.855200
  15827    0.636941
  16225    0.578295
  15634    0.629686
             ...   
  15840    0.605804
  15720    0.711004
  16378    0.645405
  14862    1.044135
  15451    0.879081
  Name: DeathsPer100k, Length: 128, dtype: float64)]
In [124]:
model_data = models[2][2]
by_cause = for_mlm.query("Cause==9")
scale_factor = by_cause.DeathsPer100k.max() - by_cause.DeathsPer100k.min()
values = np.array([model_data[0], model_data[1][0], model_data[1][1]])
values *= scale_factor


def predict_point(row):
#     display(row)
    bmi, cmr = row.BMI, row.CMR
    return values[0] + values[1]*cmr + values[2]*bmi


print(
    "Equation for Number of Deaths/100k: y = "
    + str(values[0])
    + " + "
    + str(values[1])
    + "a + "
    + str(values[2])
    + "b"
)
Equation for Number of Deaths/100k: y = 482.042362253791 + 20.171649768061695a + -15.450673344144477b

Where a is CMR and b is BMI.

Let's try to use this model to predict a country's diseases of the circulatory system deaths per 100k as time progresses.

In [125]:
mses = []
for country in altogether.CountryCode.unique():
    df = altogether[(altogether.CountryCode==country)&(altogether.Cause==cause)]
    df['Predicted'] = df.apply(predict_point,axis=1)
    mses.append([country, mean_squared_error(df.Predicted, df.DeathsPer100k)])
mses.sort(key=lambda x: x[1])
mses = pd.DataFrame(mses,columns=['CountryCode', 'MSE'])
with pd.option_context("display.max_rows", None):
    display(mses)
CountryCode MSE
0 CYP 1.756630e+04
1 FIN 2.493592e+04
2 SWE 2.763274e+04
3 DEU 2.895980e+04
4 ISL 3.039817e+04
5 NOR 3.090837e+04
6 DNK 3.157552e+04
7 CHE 3.257105e+04
8 ARE 4.147307e+04
9 CZE 4.209307e+04
10 ESP 4.216831e+04
11 QAT 4.381259e+04
12 SVN 4.395115e+04
13 CAN 4.506245e+04
14 NLD 4.611671e+04
15 JPN 4.622472e+04
16 AUS 4.698551e+04
17 LTU 4.749413e+04
18 IRL 4.808255e+04
19 FRA 4.874446e+04
20 GBR 5.103221e+04
21 GRC 5.106424e+04
22 BEL 5.132704e+04
23 MLT 5.365683e+04
24 USA 5.399627e+04
25 AUT 5.401331e+04
26 SGP 5.481897e+04
27 LUX 5.643687e+04
28 ITA 5.648277e+04
29 BRN 5.809193e+04
30 SVK 5.834531e+04
31 NZL 5.853529e+04
32 ISR 5.860404e+04
33 SRB 6.186756e+04
34 HRV 6.190597e+04
35 HUN 6.687277e+04
36 LVA 6.919834e+04
37 EST 7.176207e+04
38 OMN 7.705389e+04
39 KWT 8.144864e+04
40 POL 8.156498e+04
41 BLR 8.218131e+04
42 CUB 9.071582e+04
43 BHR 9.100407e+04
44 MNE 9.130997e+04
45 PRT 9.278518e+04
46 ATG 9.427514e+04
47 TUR 1.081574e+05
48 KOR 1.120400e+05
49 CHL 1.175164e+05
50 UKR 1.183783e+05
51 CRI 1.202552e+05
52 BGR 1.329459e+05
53 GRD 1.470977e+05
54 MDV 1.519736e+05
55 BRB 1.599864e+05
56 RUS 1.661413e+05
57 BHS 1.694963e+05
58 LCA 1.730476e+05
59 NIC 1.848846e+05
60 ROU 1.870347e+05
61 MKD 1.887737e+05
62 MDA 1.896017e+05
63 URY 1.922692e+05
64 SYC 1.943611e+05
65 BIH 1.961738e+05
66 FJI 2.119188e+05
67 THA 2.222917e+05
68 MUS 2.284557e+05
69 VCT 2.580531e+05
70 ARG 2.720921e+05
71 CPV 2.813393e+05
72 JAM 2.931736e+05
73 VEN 2.934218e+05
74 BLZ 3.143005e+05
75 COL 3.220787e+05
76 PAN 3.292773e+05
77 LKA 3.720868e+05
78 TTO 3.889303e+05
79 ALB 3.996673e+05
80 SUR 4.848915e+05
81 PHL 5.225641e+05
82 MEX 5.267333e+05
83 SLV 5.557605e+05
84 GEO 5.856613e+05
85 PRY 6.015097e+05
86 KAZ 6.824372e+05
87 ARM 6.999442e+05
88 ECU 7.162406e+05
89 ZAF 7.385145e+05
90 GUY 7.635725e+05
91 EGY 8.473231e+05
92 BRA 9.865007e+05
93 KGZ 1.058868e+06
94 MNG 1.136744e+06
95 GTM 1.246683e+06
96 UZB 1.624622e+06
97 KIR 1.679935e+06
98 TKM 1.766550e+06
99 STP 2.094275e+06
100 AZE 2.266363e+06
In [126]:
fig, axes = plt.subplots(10,5,figsize=(30,100))
def predict_countries(countries, cause):
    for x in range(len(countries)):
        country = countries[x]
        ax = axes.ravel()[x]
        df = altogether[(altogether.CountryCode==country)&(altogether.Cause==cause)]
        df['Predicted'] = df.apply(predict_point,axis=1)
        sns.scatterplot(data=df, x='Year', y='DeathsPer100k',ax=ax)
        sns.scatterplot(data=df, x='Year', y='Predicted',ax=ax)
        ax.set_xticks(ax.get_xticks()[::5])
        ax.set_title(df.iloc[0]['Country'])
        ax.set_ylabel('DeathsPer100k')
        ax.legend(['Actual','Predicted'])
predict_countries(mses.loc[:49,'CountryCode'],9)

Nice! This model appears to work for most developed and in-transition countries. A few countries look to be off by a simple translation parallel to the y-axis. This may be perhaps due to a genetic predisposition to die less from heart-related disorders or some other external factor. Countries can calibrate this model for their own usage with a few years of mortality, CMR and BMI data.

I will now show you in detail some curated graphs that display this model's effectiveness.

In [105]:
curated = ['LTU','USA','NZL','POL','PRT','CUB']
fig, axes = plt.subplots(2,3,figsize=(30,10))
predict_countries(curated,9)

Lithuania: The prediction is off by a simple translation, but I would like to draw attention to around the 1993 mark. When the actual figures unexpectedly (at least against the year) increased, the predicted values also increased. This shows that something happened in Lithuania that year which shifted deaths due to cardiovascular causes and BMI or CMR.

Poland: Same as Lithuania but at the 1996 mark. I wonder if it's due to the Soviet Union disintegrating.

United States: Pretty good prediction for one of the world's most important countries, starts deviating a bit towards the end, perhaps as the country reaches stages of late-development.

New Zealand: This model in general works well for developed countries, which is expected as the model is based off their data. The other option (use all data) would have way too much variability due to developing countries, and is much worse than this.

Cuba: As Cuba transitions towards being less of a developing country and in general becoming more stable, the actual figures become more in line with the predicted ones.

Portugal: Same as Cuba. This model is quite effective during the transition stage between developing and developed.

In [146]:
def predict_country(cause):
    country = ""
    while country != "STOP":
        country = input("Enter alpha-3 country code: ").upper()
        if country in altogether.CountryCode.values:
            df = altogether[(altogether.CountryCode==country)&(altogether.Cause==cause)]
            df['Predicted'] = df.apply(predict_point,axis=1)
            display(df)
            plt.figure(figsize=(20,10))
            ax = sns.scatterplot(data=df, x='Year', y='DeathsPer100k')
            sns.scatterplot(data=df, x='Year', y='Predicted',ax=ax)
            ax.set_title(df.iloc[0]['Country'])
            ax.set_ylabel('DeathsPer100k')
            ax.legend(['Actual','Predicted'])
            plt.show()
# uncomment this to choose any country you want to be displayed
# predict_country(9)

Countries with exceptionally low rates of death due to mental disorders should monitor the mental health of children more closely, and diagnose mental disorders early on. This would ensure that appropriate care is given to them but the truth is that in the end many of them will die later on life nonetheless due to their condition. Therefore, when a country's rate of death due to mental disorders is high, it means that it has succeeded in given it's disabled children a few extra years of life.

Countries can use their leading cause of death to judge their current state of development. If it is infectious diseases, then they are far behind the rest of the world and should work on sanitation. If it is illnesses of the circulatory system, then they are average and should implement measures such as advocating for a healthy lifestyle. If it is neoplasms, then they are very developed and there is not much they can do about it unless cancer can be cured.

Countries in the transition phase of their development can use the model in question 4 to roughly determine the number of deaths due to illnesses of the circulatory system. They can then compare the predicted value with the actual value to see if they have a disproportionately high or low number of deaths due to that cause at their current stage of development. If it is overly high, then the country should work on equipping its public spaces with facilities necessary (e.g. AEDs) to prevent deaths due to heart disease from occurring.

Russia should really advocate for a healthier diet among its citizens.